RD-A167  622 


DISTRIBUTED  COMPUTING  FOR  SIGNAL  PROCESSING:  MODELING 
OF  ASYNCHRONOUS  PAR.  .  <U>  PURDUE  UNIV  LAFAYETTE  IN 
B  N  SMITH  NAY  85  ARO-18798.  17-EL-APP-G  DAAG29-82-K-B181 


UNCLASSIFIED 


AD- A 167  622 


A  jl>>  m  ^  ***  >1*^  tf*  >>■  *Ut «*  it.  ivyt.' 


mm 


ft&o  Uiqo.n-eL 


ON  THE  DESIGN  AND  MODELING 
OF  SPECIAL  PURPOSE  PARALLEL 
PROCESSING  SYSTEMS 


Ph.D.  Thesis  by: 

Bradley  Warren  Smith 

Faculty  Advisor: 

Howard  Jay  Siegel 


Appendix  G  for 

Distributed  Computing  for 
Signal  Processing: 
Modeling  of  Asynchronous 
Parallel  Computation 
Final  Report 


U.S.  Army  Research  Office 
Contract  No.  DAAG29~82-K-010l’ 


V.APR301M6 

*  . 


’Chapters  1,  4,  and  5  supported  by  this  contract. 


This  document  has  btw 
fov  public  and  sal#}  Mi 

distribution  is  unlimited. 


86 


'w  *  28  1  80 


>VV>V  A 
r  -A*  s.J 


ON  THE  DESIGN  AND  MODELING  OF  SPECIAL 


PURPOSE  PARALLEL  PROCESSING  SYSTEMS 

A  Thesis 

Submitted  to  the  Faculty 
of 

Purdue  University 
by 

Bradley  Warren  Smith 

In  Partial  Fulfillment  of  the 
Requirements  for  the  Degree 
of 

Doctor  of  Philosophy 

May  1085 

for  p ofeHr  \ 

-  >  >•  ’  ^ _  _ 


The  author  wishes  to  extend  his  thanks  to  Dr.  James  Tilton  for  providing 
timing  data  about  the  11/70  and  for  patiently  dealing  with  the  author's 
ignorance  about  contextual  classification.  Special  thanks  is  also  extended  to 
Mark  A.  Yoder,  without  whom  the  discussion  on  the  isolated  word  recognition 
system  could  not  have  been  done  in  real-time  and  to  G.  Cooper,  E.  Coyle,  M. 
Franklin,  D.  Gannon,  G.J.  Lipovski,  D.  Meyer,  M.  Rodweli,  H.J.  Siegel,  P. 
Swain,  and  A.  Van  Tilburg  whose  careful  readings  and  comments  helped  to 
organize  and  clarify  the  ideas  presented  in  this  document.  Special  thanks  is 
extended  to  D.  Curry  and  K.  Rodweli  for  their  time  and  effort  in  the  final  rush 
in  producing  this  final  copy. 

Portions  of  the  research  presented  here  were  sponsored  by  the  National 
Aeronautics  and  Space  Administration,  under  Contract  No.  NAS729- 15-166, 
the  Air  Force  Office  of  Scientific  Research,  Air  Force  Systems  Command, 
USAF,  under  Grant  No.  AFOSR-78-353I,  the  Defense  Mapping  Agency, 
monitored  by  the  United  States  Air  Force  Command,  Rome  Air  Development 
Center,  under  Contract  No.  F30602-C-0193,  the  U.S.  Army  Research  Office, 
Department  of  the  Army,  under  Contract  DAAG29-82-K-0101,  and  by  the 
National  Science  Foundation  under  Grant  ECS-81-20896. 

Some  of  this  material  was/will  be  presented  in: 

H.J.  Siegel,  P.H.  Swain,  and  B.W.  Smith,  “Remote  sensing  on  PASM  and 
CDC  flexible  processors,”  Multi-Computer  Algorithms  and  Image  Process¬ 
ing,  ed.  K.  Preston  and  L.  Uhr,  Academic  Press,  New  York,  NY',  1982. 

B.W.  Smith.  H.J.  Siegel,  and  P.H.  Swain,  “Contextual  classification  on  a 
CDC  flexible  processor  system,”  1981  Machine  Processing  of  Remotely 
Sensed  Data  Syrup.,  pp.  283-291,  June  1981. 

B.W.  Smith,  H.J.  Siegel,  and  P.H.  Swain,  “Parallel  processing  concepts  for 
remote  sensing  applications,”  1982  Machine  Processing  of  Remotely  Sensed 
Data  Symp.,  Purdue  Unix.,  pp.  520-526,  June  1982. 

B.W.  Smith  and  H.J.  Siegel,  “Models  for  use  in  the  design  of  macro- 
pipelined  parallel  processors,”  12th  Int'l.  Computer  Architecture  Symp  , 
June  1985. 


Ill 


TABLE  OF  CONTENTS 


LIST  OF  TABLES 


LIST  OF  FIGURES 


ABSTRACT 


CHAPTER  1  INTRODUCTION. 


Overview . 

A  Survey  of  Parallel  Architectures  for 

Image  Processing . 

Hardware  Taxonomies . 

SIMD  Systems . 

1.4.1.  Three  Bit-serial  SIMD  Systems . 

1.4.2.  STARAN,  DAP,  and  CLEP4  --  Comparisons 

and  Contrasts . 

MIMD  Systems . 

1.5.1.  Cvtocomputer  --  A  Bit-serial  MIMD 

System . 

1.5.2.  PICAP  II  --  A  Word  Oriented  MIMD 

Machine . 

Conclusions . 


CHAPTER  2  PARALLEL  PROCESSING 

IMPLEMENTATIONS  OF  A  CONTEXTUAL 
CLASSIFIER . 


Introduction . 

Contextual  Classification . 

2.2.1.  Definitions . 

2.2.2.  Uniprocessor  Algorithm . 

MIMD  Implementation  on  the  CDC 

Flexible  Processor  System . 

2.3.1.  Flexible  Processor  Svstem . 


IV 


Page 

2.3.2.  Linear  Contextual  Classifiers . 45 

2.3.3.  Non-linear  Contextual  Classifiers . 49 

2.3.4.  Processing  of  Images  with 

Large  Numbers  of  Gp's . 53 

2.3.5.  Processing  of  Images  in  Bulk  Memory . 54 

2.3.6.  A  16  FP  System . 55 

2.3.7.  Processing  of  Large  Images . 60 

2.3.8.  Summary . 63 

2.4.  SIMD  Implementations  on  PASM . 63 

2.4.1.  Introduction . 63 

2.4.2.  Overview  of  PASM . 63 

2.4.3.  Linear  Contextual  Classification 

on  PASM . 67 

2.4.4.  Non-Linear  Contextual  Classification 

on  PASM . 70 

2.5.  Conclusions . 72 

CHAPTER  3  PARALLEL  PROCESSING  CONCEPTS  FOR 

REMOTE  SENSING  APPLICATIONS . 74 

3.1.  Introduction . 74 

3.2.  Machine  Architecture . 75 

3.3.  Smoothing  on  a  Parallel  SIMD  Machine . 91 

3.4.  Maximum  Likelihood  Classification . 95 

3.5.  Contextual  Classification . 97 

3.6.  Image  Correlation  on  a  Parallel  Machine . 98 

3.7.  The  Fault  Rolerance  of  MuRSS . 101 

3.8.  An  Enhanced  MuRSS . 105 

3.9.  MPP  -  A  Massively  Parallel  Processor . 118 

3.10.  Conclusions . 128 

CHAPTER  4  MODELS  FOR  USE  IN  THE  DESIGN  OF 
SPECIAL  PURPOSE  MACRO-PIPELINED 
PARALLEL  PROCESSORS . 130 

4.1.  Introduction . 130 

4.2.  The  Hardware  Database . 136 

4.3.  Response  Time  -  Its  Meanings  and  Interpretations . 145 

4.4.  Parallelism,  Task  Devision,  and  Design  Scenario . 146 

4.5.  Evaluation  Categories  -  The  Relationship 

Between  the  Layer  and  the  Level . 152 

4.6.  An  Isolated  Word  Recognition  System  - 

Task  Description . 170 


4.7.  Application  of  Theory  to  Scenario 

4.8.  Conclusions . 


CHAPTER  5  ASYNCHRONOUS  AND  SYNCHRONOUS 

SYSTEMS  ADVANTAGES  AND 

DISADVANTAGES . 

5.1.  Introduction . 

5.2.  Determination  of  Pv(t)  For  a  Single 

Input/Processing  Stream . 

5.3.  The  Expected  Queuesize  of  an  Asynchronous 

System . 

5.4.  A  Comparison  of  Synchronous  and  Asynchronous 

Systems . 

5.4.1.  Introduction . 

5.4.2.  Initial  System  Models  -- 

Three  Potential  Architectural  Schemes . 

5.4.3.  Analysis  of  Synchronous  Models  with  Two 

Probabilistic  Models . 

5.4.4.  Analysis  of  an  Asynchronous  System  -- 

Two  Probabilistic  Models . 

5.4.5.  Analysis  of  Systems  Composed  of  Two 

Levels  Whose  Response  Times  are  Random 
Variables  . . 

5.4.6.  Analysis  of  Q . . 

5.4.7.  Double-buffering  Versus  Triple-buffering  -- 

An  Analysis . 

5.4.8.  Synchronous  Systems  Versus  Asynchronous 

Systems . 

5.5.  System  Simulation  --  Results . 

5.6.  Conclusions . 

LIST  OF  REFERENCES . 

APPENDIX 

Simulator  Listings . 


VI 


LIST  OF  TABLES 


m 


Table 

1.3.1 

5.4.3. 1 


Page 


5.4.4. 1 


5.4.5. 1 

5.4.5. 2 

5.4.6. 1 

5.5.1 

5.5.2 

5.5.3 


Kuck’s  sixteen  categories  of  computer 
architectures  [SiS83] . 


Double  buffered  system  (DB)  and  triple  buffered  system  (TB): 
SRT  and  ST  when  tj  is  fixed 

and  to  Gaussian  random  variable . 


5. 4.3.2  SRT(DB),  SRT(TB),  and  ST  for  tj  fixed 
and  to  uniform  random  variable . 


The  expected  queue  size  (Q),  expected  time  spent  in 
a  queue  (W),  the  expected  system  response  time  (SRT), 
and  the  expected  system  throughput  (ST)  for  an 
asynchronous  system  with  t1  fixed  and  t2  an 
arbitrary  random  variable . 


Statistics  for  a  synchronous  system  with  both  tj 
and  to  uniform  random  variables . 


Asynchronous  system  stasties  (Gaussian) . 228 

Probability  of  overflow  versus  Q . 232 

Uniform  system  response  times  (200000  samples) . 238 

Gaussian  system  response  times . 239 

Synchronous  system  response  times  when  the 
response  times  of  both  level  1  and  level  2 


215 

v.v.V 

*  -•  A.’ 

221 

V.S.' 

228 

;v>; 

•  *>v 

.V 

232 

m 

238 

239 

A  *  *  ’  « 

241 

A# 

V  *.*  “>  *_•  \  • 


J  2 ’  •  *  '  *  * 


LIST  OF  FIGURES 


The  CLEP4  system  configuration  [Duf8‘2] . 

Interconnection  in  CLIP  arrays  [Duf82] . 

The  complete  logic  circuit  for  CLIP4  [Duf82] . 

DAP  processing  element  [Red79] . 

Series  E  array  module  [Bat76j . 

Internal  block  diagram  of  a  memory  array  [Thu76] 

Pipeline  image  processor  [Ste80j . 

Moving  window  implemented 

with  shift  register  storage  [Ste80] . 

PICAP  II  system  architecture  [Krd82] . 

Linear  neighborhoods  . 

Uniprocessor  implementation  of  size 

three  contextual  classifier  algorithm  (p-2) . 

Pixels  whose  “compf”  values  are  stored 
in  “hold"  array . 

Components  of  an  FP  ((C'DC77a].[CDC77b]) . 

A  potential  FP  system 

configuration  ([CDC77a].[CDC'77b]) . 

Striping  method  of  dividing  an 

I-bv-J  image  among  N  FPs . 


V 

V 

Figure 

i 

2. 3. 2. 2 

1 

r* 

> 

2.3.3. 1 

> 

2.3.6. 1 

2.3.62 

r: 

»*.. 

r. 

u 

2. 3.7.1 

1 

2.4.2. 1 

2. 4. 2. 2 

i 

2.4.3. 1 

r, 

t. 

t. 

2.4.4. 1 

k- 


P 


3.2.1 

3.2.2 

3.2.3 

3.2.4 

3.2.5 
3.7.1 


3  8.1 

3.8.2 

3.8.3 


LX 


Linear  neighborhoods . 

Non-linear  neighborhoods . 

A  potential  FP  architecture  for  image  processing . 

Data  required  for  classification  of  non-linear 
windows . 

Processing  an  image  that  is  too 

large  for  bulk  memories . 

Block  diagram  of  PASM  [SiS81] . 

Architecture  of  the  PCU  [SiSSlj . 

Modified  striping  scheme . 

Dividing  an  image  using  a  “checkerboard”  pattern. 
Each_square  represents  one  PE  with  a 
( 1/ \/N )— by— ( J/ \/N)  subimage. 

The  PE  number  is  in  the  square . 

MuRSS  system  architecture . 

N  +  l  PU  MuRSS  system  overview . 

MuRSS  system  architecture  during  normal  operation 

Bus  structure  of  MuRSS  PU . 

Organization  of  MuRSS  memory . 

Minimum  number  of  usable  PUs  in  a  1024 

PI’  MuRSS  versus  number  of  faulty  PUs . 

Fault  tolerant  MuRSS  svstem  architecture . 


»  f  m  *  *  *  »  .  ••’O' 


A  A." 


Double  faults  in  EMuRSS  leaving  7  usable  PUs 
Two  modes  of  a  bypass  box . 


110 


X 


Figure  Page 

3.8.4  EMuRSS  reconfiguration  around 

two  box  faults  on  same  shared  bus . 113 

3.8.5  EMuRSS  reconfiguration  around  faulty 

shared  memory  bus . 114 

3.8.6  EMuRSS  reconfiguration  around  two 

bypass  box  faults  associated  with  PlT  1 . 115 

3.8.7  EMuRSS  reconfiguration  around  three 

adjacent  bypass  box  faults . 116 

3.8.8  Ratio  of  usable  Pi’s  in  EMuRSS 

to  usable  Pi's  in  MuRSS . 1 1 D 

3.9.1  Block  diagram  of  MPP  {[BatS0].[Bnt82]) . 120 

3.9.2  MPP  PE  architecture  ([Bat80].[Bat82]) . 122 

3.9.3  Block  diagram  of  the  ACE1  ([Bat80j.[Bat82]) . 124 

4.1.1  Layering  of  isolated  word  recognition  system . 133 

4.5.1  Buffering  implementations . 157 

4.5.2  Scenario  before  and  after  application  of 

techniques  in  [IIuL82] . 165 

4.6.1  Isolated  word  recognition  svstem . 171 

4.6.2  Layering  of  proposed  scenario . 172 

4.6.3  Durbin's  algorithm  to  compute  LPC  coefficients 

a;  from  autocorrelation  coefficients  [Yod82] . 177 

4.6.4  Energy  of  typical  utterance . 179 

4.6.5  Example  of  time  warping  [Yod82j . 182 

4.6.6  Adjustment  window  of  width  r  [Yod82] . 184 

4.6.7  Sample  DTW  algorithm . 185 


Allowable  architectures  and  feedback  paths 


Three  architectures  under  consideration 
Three  orientations  of  t2  relative  to  t,.... 

Queuelength  as  f(t*2)  c=0.1 . 

Queuelength  as  f( 1 2 )  c=0.5 . 


Queuelength  as  f( 1 2 )  c-1.0 
Q  as  f(C’)  [b=0.75j . 


ABSTRACT 


Smith,  Bradley  Warren.  Ph.D.,  Purdue  University,  May  1985.  ON  THE 
DESIGN  AND  MODELING  OF  SPECIAL  PURPOSE  PARALLEL 
PROCESSING  SYSTEMS.  Major  Professor:  Howard  Jay  Siegel. 

- ) 

As  the  capabilities  of  computing  machinery  grow,  so  does  the  diverse 
variety  of  their  applications.  The  feasibility  of  many  approaches  to  these 
applications  depends  solely  upon  the  existence  of  computing  machinery  capable 
of  performing  these  tasks  within  a  given  time  constraint.  Because  the  majority 
of  the  available  computing  machinery  is  general  purpose  in  nature,  tasks  that 
do  not  require  general  purpose  facilities,  but  that  do  require  high  throughput, 
are  condemned  to  execution  on  expensive  general  purpose  hardware. 

This  research  describes  several  tasks  that  require  fast  computing 
machinery.  These  tasks  do  not  require  general  purpose  facilities  in  the  sense 
that  the  computing  machinery  used  will  only  perform  a  fixed  set  of  tasks. 
Some  of  the  tasks  are  simple  in  nature,  but  are  required  to  execute  on  very 
large  data  sets.  Other  tasks  are  computationally  intensive  in  addition  to 
possibly  involving  large  data  sets.  Both  simple  and  complex  algorithms  are 
considered.  The  discussion  includes  a  description  of  the  tasks. 

All  of  the  above  tasks  are  useful;  however,  their  value  is  determined  in 
part  by  the  time  required  to  perform  them.  This  work  discusses  three 
architectures  for  performing  remote  sensing  tasks.  These  architectures  can 


execute  the  described  tasks  more  quickly  than  conventionally  available 

hardware.  S- - 

\ 

The  discussion  extends  to  the  realm  of  designing  macro-pipelined 
distributed  computer  systems  for  special  purpose  applications.  Nine 
parameters  are  introduced  along  with  a  proposal  for  an  algorithmic  approach  to 
designing  a  computer  system  for  a  special  application.  The  parameters  are 
then  applied  to  an  isolated  word  recognition  system. 

For  may  tasks  (especially  those  involving  feedback),  it  is  undesirable  to 
use  synchronous  parallelism.  A  study,  including  a  probabilistic  model,  of  the 
effects  of  using  asynchronous  stages  in  the  macro-pipeline  is  presented. 
Simulation  is  used  to  verify  the  results. 


CHAPTER  1 


INTRODUCTION 

1.1  Overview 

For  many  applications,  response  time  and  throughput  are  of  critical 
importance.  Such  applications  include:  defense  against  incoming  missiles, 
missile  guidance,  air  traffic  control,  weather  analysis,  speech  recognition,  and 
tomography.  The  principal  goal  is  to  process  the  data  in  “relevant”  time 
within  some  cost  criteria.  Further,  the  feasibility  of  performing  many  tasks 
depends  on  the  capability  to  execute  them  in  a  certain  amount  of  time  without 
excessive  hardware  expense. 

General  purpose  hardware,  while  less  expensive  than  special  purpose 
hardware,  is  typically  slower  than  hardware  designed  for  a  specific  task.  The 
design  of  special  computing  facilities  can  take  large  amounts  of  time  and 
manpower,  increasing  the  design  overhead  of  such  a  system  over  a  general 
purpose  system.  Since  special  purpose  computer  systems  typically  do  not  sell 
in  large  quantities,  the  design  cost  must  be  distributed  over  a  relatively  small 
number  of  units.  Thus,  the  cost  of  special  purpose  computer  systems  can  be 
considerably  greater  than  that  of  general  purpose  computer  systems.  The  high 
cost  of  special  purpose  hardware  decreases  the  desirability  of  algorithms  that 
require  special  purpose  computer  systems.  Thus,  accurate  and  powerful 
algorithms  may  not  be  used  in  lieu  of  less  accurate  algorithms  or,  even  worse. 


nothing  at  all.  To  help  reduce  the  cost  of  special  purpose  systems,  computer 
aided  tools  can  be  used  to  minimize  the  human  intervention  needed  in  the 
computer  design  process.  These  tools  would  reduce  the  design  time.  To 
achieve  this  goal,  tasks  must  be  modeled  as  to  the  type  of  computational 
resources  they  require.  Further,  presently  available  hardware,  such  as  small 
boards  and  chips,  must  be  modeled  according  to  their  computational 
capabilities.  By  extending  the  models  to  parallel  schemes,  combination  of  the 
two  models  allows  systems  to  be  proposed  or  built  to  perform  computationally 
intensive  tasks  within  some  time,  cost,  or  other  constraint. 

This  research  is  divided  into  four  chapters.  Chapter  2  considers  the 
application  of  parallelism  to  contextual  classifiers  for  image  analysis  which  are 
being  developed  to  exploit  the  spatial/spectral  content  of  a  picture  element 
(pixel)  to  achieve  higher  classification  accuracy.  Contextual  classification 
requires  large  amounts  of  computation,  so  special  hardware  is  of  value. 
Chapter  2  explores  the  CDC  Flexible  Processor  (FP)  system 
([CDC77a],[C‘DC77b])  and  the  proposed  multimicroprocessor  system  PASM 
[SiSS  1] ,  which  are  both  parallel  processing  systems  that  can  be  applied  to  image 
processing  tasks.  Timings  for  the  FP  system  to  perform  contextual 
classifications,  based  on  a  Purdue  developed  FP  system  simulator,  are 
presented.  For  comparison,  the  same  algorithms  have  been  run  on  a  PDP- 
11/70.  The  applicability  of  PASM  for  implementing  the  contextual  classifier  is 
demonstrated  by  algorithm  complexity  analysis.  The  reduction  in  execution 
achieved  through  the  use  of  these  parallel  systems  is  shown. 

The  research  in  Chapter  2  has  suggested  a  specific  architecture  for  the 
application  of  parallel  processing  to  remote  sensing  tasks.  Chapter  3  proposes 
such  an  architecture.  It  is  a  large-scale  multimicroprocessor  structure  which 
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■  could  consist  of  as  many  as  1024  processors.  This  type  of  architecture  is 

|  extremely  well  suited  to  the  execution  of  window  and  pixel  based  operations. 

t  A  number  of  remote  sensing  data  processing  techniques  for  implementation  on 

/  a  machine  with  this  architecture  are  discussed.  Algorithms  considered  are: 

I  image  smoothing,  image  correlation,  and  contextual  and  non-contextual 

methods  of  image  analysis.  This  includes  both  the  design  of  parallel  algorithms 
and  the  exploitation  of  appropriate  data  structures. 

^  In  addition  to  demonstrating  how  various  algorithms  can  be  performed  on 

the  parallel  architecture,  Chapter  3  proposes  extensions  to  the  architecture  to 
increase  its  fault  tolerance.  Then,  a  specific  implementation  of  the 
architecture,  called  MuRSS,  is  contrasted  to  an  already  existing  system  called 
MPP.  MuRSS  and  MPP  are  compared  with  respect  to  speed,  processing 
capabilities,  and  fault  tolerance. 

In  Chapter  4,  an  approach  to  modeling  distributed  macro-pipelined 
computer  systems  is  examined.  This  chapter  uses  nine  parameters  to  form  a 
model  of  the  characteristics  of  parallel/distributed  algorithms  and  the 
environment  in  which  they  must  execute.  These  parameters  describe  the  I/O 
environment,  the  algorithm,  the  memory  requirements  of  the  algorithm,  and 
the  type  and  amount  of  arithmetic  calculations  required  by  the  algorithm  to 
process  a  normal  data  set. 

In  addition,  Chapter  4  uses  tuples  to  model  the  characteristics  of  computer 
architectures.  These  tuples  describe  the  instruction  set.  the  instruction 
processing  times,  the  size  and  speed  of  on-board  cache,  the  data  and  address 
widths,  the  replication  of  units,  the  number  of  stages  in  pipelined  units,  and 
the  functional  overlap  for  each  unit  in  the  architecture.  By  combining  the 
tuples  with  the  nine  parameters,  the  execution  time  of  the  algorithm  modeled 


by  the  parameters  on  the  hardware  modeled  by  the  tuples  can  be  estimated. 
The  combination  of  these  two  models  could  be  used  as  a  basis  for  computer 
aided  design  tools  used  for  special  purpose  parallel/distributed  processors.  This 
chapter  uses  a  layered  method  of  architecture  design,  in  which  a  task  is  broken 
down  into  sub-tasks.  Each  sub-task  is  then  assigned  to  a  special  purpose 
processing  unit.  Such  a  unit  may  be  either  a  traditional  serial  type  design  or  a 
parallel  design. 

Chapter  5  extends  the  work  done  in  Chapter  4  by  looking  at  the  effects  of 
both  synchronous  and  asynchronous  stages  in  macro-pipelined  machines.  Two 
synchronous  schemes  (double  buffering  and  triple  buffering)  are  compared  to  an 
asynchronous  system  with  respect  to  throughput  and  system  response  time. 
Theoretical  results  are  presented.  A  simulator  to  calculate  the  throughput  and 
system  response  time  of  each  system  has  been  developed  to  verify  the  theory. 
The  results  of  the  simulation  of  over  200,000  data  sets  are  presented. 

1.2.  A  Survey  of  Parallel  Architectures  for  Image  Processing 

The  purpose  of  remainder  of  this  chapter  is  to  give  background 
information  pertinent  to  the  rest  of  this  work.  Two  taxonomies  or  hardware 
description  schemes  are  discussed  in  Section  1.3.  Sections  1.4  and  1.5  describe 
a  number  of  proposed  and  implemented  parallel  and/or  distributed  processing 
systems  that  can  be  used  for  image  processing.  The  systems  discussed  in  this 
chapter  are:  CLIP4  -  the  Cellular  Logic  Image  Processor  [Duf82,  DuW73, 
Fou81,  Ger83j;  Cytocomputer  -  a  pipelined  image  processor  [PrD79,  Ste80] ; 
DAP  -  the  Distributed  Array  Processor  [Ger83,  Hun81,  Red79];  the  FP  array 
-  CDC's  Flexible  Processor  array  [.41182,  SiS80,  SiS82c.  SmSSl,  SwS80l;  MPP  - 


the  Massively  Parallel  Processor  [Bat80,  Bat82,  Ger83,  Pot82aJ;  PASM  -  the 
PArtitionable  SIMD/MIMD  system  [SiM81a,  SiS81,  SiS82c,  Sie8l];  PICAP  - 
the  Picture  Array  Processor  [KrD82,  KrG82,  Gud81];  and  STARAN  - 
Goodyear  Aerospace's  associative  processor  system  (Bat74,  Bat76,  Bat77b, 
Bat82,  FeF74,  Ger83,  Thu76,  Pot82b). 

1.3.  Hardware  Taxonomies 

Currently,  there  are  two  classes  of  computer  hardware  taxonomies.  There 
are  hardware  taxonomies  that  classify  (e.g.,  tiger)  and  those  that  describe  (e.g.. 
four  paw’s,  16  sharp  claws,  ravenous  meat  liking  appetite,  etc.).  The 
classification  taxonomies  provide  only  the  most  general  information,  omitting 
details  for  ease  of  use.  Several  descriptive  taxonomies  have  been  developed  to 
accurately  describe  the  architecture  of  computer  hardware.  These  descriptive 
taxonomies  are  often  so  cumbersome  that  they  cannot  be  used  verbally  to 
convey  their  thought. 

One  of  the  first  taxonomies,  proposed  in  [Fly66],  is  a  classification 
taxonomy.  This  taxonomy  classifies  a  system  based  on  the  number  of 
concurrent  instruction  and  data  streams.  A  machine  has  either  a  single  stream 
or  multiple  streams  in  this  taxonomy. 

A  machine  that  executes  a  Single  Instruction  stream  on  a  Single  Data 
stream  is  called  an  SISD  machine.  Some  examples  of  SISD  machines  are  the 
IBM  370/155,  the  DEC  PDP-11/70,  and  the  DEC  VAX-11/780.  Machines  that 
execute  a  Single  Instruction  stream  on  Multiple  Data  streams  are  called  SIMD 
machines.  Some  examples  of  SIMD  machines  are  CLIP4,  ILLLAC  IV'  [Bar68, 
Bou72,  Sto80j,  MPP,  PASM  (in  SIMD  mode),  PICAP  I,  and  STARAN.  In  such 
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systems,  a  control  unit  broadcasts  the  same  instruction  to  all  processors,  and 
all  enabled  processors  execute  the  same  instruction  simultaneously,  each 
processor  on  its  own  data  stream.  Machines  that  execute  Multiple  Instruction 
streams  on  Multiple  Data  streams  are  called  MIMD  machines.  Some 
examples  of  MIMD  machines  include  the  CDC  Flexible  Processor  Array,  PASM 
(in  MIMD-  mode),  PICAP  II,  and  Cytocomputer  [LoMSOj.  A  machine  that 
executes  a  Multiple  Instruction  on  a  Single  Data  stream  is  called  an  MISD 
machine.  Macro-pipelined  machines  fall  into  this  category.  The  design  of  such 
machines  is  the  topic  of  discussion  for  Chapter  4. 

The  classes  of  machines  in  this  taxonomy  are  very  broad.  For  example, 
MPP,  whose  Processing  Elements  (PEs)  operate  on  one  bit  of  data  at  a  time 
falls  into  the  same  class  (SEMD)  as  ILLIAC  IV,  whose  PEs  operate  on  64  bits  of 
data  simultaneously.  In  addition,  this  taxonomy  gives  no  indication  of  the 
relative  size  of  a  machine.  For  example,  PASM  (in  MIMD  mode),  which  could 
consist  of  up  to  1024  PEs,  is  in  the  same  class  as  the  CDC  FP  array,  which  can 
consist  of  up  to  16  PEs.  Several  taxonomies  have  been  proposed  to  narrow  the 
classes,  at  the  expense  of  simplicity.  Flynn's  taxonomy,  however,  still  remains 
the  simplest  and  most  widely  used. 

In  contrast  to  Flynn’s  taxonomy,  which  categorizes  computers  according 
to  their  instruction  and  data  streams,  the  classification  taxonomy  in  [Kuc78] 
proposes  to  classify  hardware  according  to  the  instruction  stream(s),  instruction 
type,  execution  stream(s),  and  execution  type.  As  in  Flynn’s  taxonomy,  the 
instruction  and  execution  streams  can  be  either  single  or  multiple.  The 
instruction  and  execution  types  can  be  either  scalar  or  array. 

The  number  of  instruction  streams  is  determined  by  the  number  of 
concurrently  executable  programs.  For  a  program  to  be  executable,  it  requires 


a  program  location  counter  to  point  to  the  next  instruction  to  be  executed. 

If  the  arguments  to  any  machine  language  instruction  (operands)  are 
arrays,  the  instruction  type  is  array.  If  no  machine  language  instruction  can 
accept  an  array  (vector)  as  an  argument,  the  instruction  type  is  scalar.  For 
example,  consider  the  instruction: 

move  a,m 

If  “a”  is  a  single  element  and  “m”  is  a  memory  location  this  instruction  type  is 
scalar.  Systems  that  have  scalar  type  instructions  include:  the  AMD  951 1A 
[Amd82],  the  CDC  FP  array  [CDC77a,  A1182),  the  CDC  6600  [Che80j,  CLIP4 
[Duf82,  Fou81],  ILLIAC  IV  [Bar68,  Bou72),  MPP  [Bat80],  PASM  [SiS81,  Sie82], 
and  STARAN  [Bat76,  Bat77b].  For  the  instruction: 

move  a, m,  1000 

if  “a"  is  the  base  address  of  an  array,  “m”  is  a  memory  location,  and  1000  is 
the  number  of  bytes  to  be  moved,  then  the  instruction  is  implicitly  performing 
an  array  operation.  For  this  latter  case,  the  instruction  type  is  array.  For  a 
system  to  have  array  type  instructions,  it  must  include  at  least  one  array 
instruction.  Systems  that  have  array  type  instructions  are:  OMEN  [Thu76], 
VAMP  [Che80,  Thu76],  and  the  TI-ASC  [Che80] .  An  example  of  a  chip  that 
has  an  array  type  instruction  is  the  Zilog-Z80  [SiS83] . 

The  number  of  execution  streams  is  determined  by  the  variety  of 
operations  that  can  be  performed  simultaneously  by  the  system.  Either  a 
system  can  perform  a  single  operation  or  multiple  operations  at  once.  Multiple 
copies  of  a  single  operation  count  as  a  single  operation.  Systems  that  fall  into 
the  single  execution  stream  category  are  all  systems  in  the  SISD  and  SEMD 


classes  of  Flynn’s  taxonomy  that  allow  no  overlapping  of  different  instructions 
(e.g.,  no  overlap  of  control  unit  and  PE  operations).  An  example  of  a  machine 
that  has  a  single  instruction  stream  of  scalars  with  a  multiple  execution  stream 
is  the  CDC  6600.  The  CDC  6600  has  two  multipliers,  the  execution  of  which 
can  be  overlapped  with  the  addition  unit.  From  a  single  job  stream,  both  an 
addition  and  multiplication  can  be  taking  place  at  the  same  time,  although 
they  cannot  be  initiated  simultaneously,  thus,  there  exists  multiple  execution 
streams.  Another  example  of  a  machine  that  has  a  single  instruction  stream  of 
scalars  with  a  multiple  execution  stream  is  the  VAX  11/780  with  the  floating 
point  accelerator.  A  VAX  11/780  can  overlap  slower  floating  point  operations 
with  integer  instructions,  giving  multiple  executions  simultaneously.  Without 
the  floating  point  accelerator,  the  VAX  cannot  overlap  operations  in  any  way, 
thus  the  system  must  wait  for  the  result  of  any  operation  before  continuing. 
Thus,  the  VAX  without  the  floating  point  processor  is  an  example  of  a  system 
that  has  a  single  instruction  and  single  execution  stream. 

The  execution  type  is  either  scalar  or  array  and  is  determined  by  the 
number  of  operands  to  which  a  machine  language  instruction  can  be  applied 
simultaneously.  A  system  where  a  single  machine  language  instruction  operates 
on  multiple  operands,  like  the  ILLLAC  IV  SEMD  machine,  which  issues  scalar 
instructions  that  act  upon  64  operands,  is  said  to  have  an  array  execution  type. 
If  no  machine  language  instruction  can  act  on  multiple  operands 
simultaneously,  the  execution  type  is  scalar. 

The  nomenclature  is  formed  by  describing  the  instruction  stream  and  type 
with  the  execution  stream  and  type.  Systems  such  as  the  PDP-11/70,  which 
have  a  Single  Instruction  stream  that  performs  Scalar  instructions  on  a  Single 
Execution  stream  of  Scalars  are  classified  as:  SISSES.  ILLIAC  TV',  which  has 
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scalar  type  instructions  fetched  by  one  control  unit  and  broadcast  to  64 
execution  units,  is  classified  as:  SISSEA  (assuming  no  instructional  overlap  is 
allowed).  The  CDC  6600  has  a  single  instruction  stream  of  scalar  instructions 
that  control  a  multiple  execution  stream  of  scalars  and  is  classified  as 
SISMES.  The  TI-ASC  has  a  single  instruction  stream  of  array  instructions 
that  controls  a  multiple  execution  stream  of  array  operations  is  classified  as 
SIAMEA.  Table  1.3.1  [SiS83]  shows  what  machines  fall  into  which  classes. 
Ruck's  scheme  is  a  more  precise  classification  taxonomy;  however,  it  is  also 
more  cumbersome  to  use. 

The  descriptive  taxonomy  set  forth  in  [HoJ81]  describes  the  architecture  of 
a  machine  in  an  algebraic  style  suitable  to  printing  and  entry  into  a  computer. 
A  SISD  computer  in  this  notation  would  be  described  as: 

C=I[E-M] 


This  means  that  the  computer  (C)  is  composed  of  a  single  instruction  unit 
controlling  an  execution  unit  (E)  and  a  memory  unit  (M).  There  are  twenty 
rules  that  govern  symbols,  their  use,  and  how  they  are  connected.  A  synopsis  of 
this  notation  appears  in  both  [Ho J8 1]  and  [SiS83]. 

Other  descriptive  taxonomies  are  set  forth  in  [Gil83]  and  [BeNTl].  These 
notations,  while  similar  to  the  notation  set  forth  in  [HoJ81),  have  one 
important  concep'ual  difference.  The  notation  in  [HoJ8l]  is  specifically  two 
dimensional,  i.e.,  the  architecture  of  the  system  can  be  described  in  a  two 
dimensional  manner.  The  notations  in  [Gil83]  and  [BeNTl]  are  three 
dimensional  in  nature,  making  them  very  difficult  to  parse.  A  discussion  of  each 
of  the  taxonomies  appears  in  [SiS83],  along  with  several  examples.  In  general, 
Flynn's  hardware  classification  scheme  will  be  used  here.  A  special  descriptive 


--■>  V- 


10 


Table  1.3.1 

Kuck’s  sixteen  categories  of  computer  architectures  [SiS83]. 


SINGLE  EXECUTION 

MULTIPLE  EXECUTION 

TYPE 

SCALAR 

ARRAY 

SCALAR 

ARRAY 

SINGLE 

INSTRUCTION 

SCALAR 

PDP  11/45 

ILLIAC  IV 

ST ARAN 

(PASM) 

(TRAC) 

CDC  6800 

CPU 

ONfF.N-60 

ARRAY 

ZILOG  Z80 

CYBER 

203/205 

NONE 

KNOWN 

CRAY-1 

BSP 

CDC  7600 

TIASC 

MULTIPLE 

INSTRUCTION 

SCALAR 

CDC  8600 

PPU 

NONE 

KNOWN 

BURROUGHS  FMP 

DATA  FLOW 

(PASM) 

(TRAC) 

DENELCOR  HEP 

PASM 

(TRAC) 

ARRAY 

UNDESIRABLE 

DESIGN 

NONE 

KNOWN 

NONE 

KNOWN 

rerE 

CDC  NASF 

TRAC 

PUMPS 

taxonomy  is  needed  and  is  proposed  in  Chapter  4.  There,  computer  hardware 
needs  to  be  described  by  its  capacity  and  speed  of  execution  in  such  a  manner 
that  timing  information  can  be  simply  obtained. 

For  the  application  in  Chapter  4  that  Flynn’s  taxonomy  does  not  provide 
enough  information  about  system  architecture  to  be  of  use.  The  taxonomy  in 
[Gil83]  limits  the  level  of  description  of  a  system  in  addition  to  not  specifically 
stating  how  a  system’s  resources  are  to  be  connected.  A  more  explicit 
representation  of  the  overall  system  architecture  can  be  found  in  [BeN7l]; 
however,  this  description  is  two  dimensional.  Thus  it  is  inconvenient  to  store 
in  a  computer,  and  quite  difficult  to  analyze.  Finally,  it  is  undesirable  to  apply 
the  taxonomy  set  forth  in  [Ho J81]  because  the  depth  of  the  description  is 
arbitrary.  Therefore,  different  people  can  differently  describe  the  same  machine. 
Thus,  while  all  of  the  above  taxonomies  are  of  importance,  none  is  directly 
applicable  to  the  application  in  Chapter  4. 

1.4.  SIMD  Systems 

The  SIMD  systems  discussed  in  this  work  fall  into  the  following  two 
categories.  Bit-serial  systems  are  composed  of  PEs  that  can  process  only  a 
single  bit  at  a  time.  Bit-parallel  systems  are  composed  of  PEs  that  process 
multiple  bits  at  once.  Such  PEs  are  said  to  process  words.  CL1P4  [Duf82], 
DAP  [Red79],  MPP  [Bat80,  Bat82],  and  STAR  AN  [Bat74,  Bat76,  Bat77b, 
Pot82j  are  all  bit-serial  systems.  ILLIAC  IV  [Bar68,  Bou72],  MuRSS  [SmS82], 
and  PASM  [SiS81]  are  all  bit-parallel  or  word  organized  system.  All  of  the 
systems,  except  PASM,  are  purely  SIMD  machines.  PASM,  however,  can  be 
either  SIMD  or  MIMD  as  needed. 


Section  1.4.1  will  discuss  DAP,  CLEP4,  and  STARAN.  The  strengths  and 
weaknesses  of  D.AP,  CLIP4,  and  STARAN  are  presented  in  Section  1.4.2. 
ILLLAC  rV  and  its  applications  have  been  extensively  discussed  in  [Bar68, 
Bou72,  HoS82,  Sto80,  Thu76],  PASM  is  described  in  Chapter  2.  Both  MuRSS 
and  MPP  are  presented  detail  in  Chapter  3.  For  brevity,  a  discussion  of 
ILLLAC  rV,  PASM,  MuRSS,  and  MPP  is  omitted  here. 

1.4.1.  Three  Bit-serial  SIMD  Systems 

The  Cellular  Logic  Image  Processor  (CLIP)  series  of  processors  was  first 
completed  in  1971.  Since  that  time,  five  variations  on  the  original  machine 
have  been  built.  Most  recently,  CLEP4,  a  96-by-96  processor  array,  designed  to 
process  video  input  from  a  TV  camera,  was  completed.  The  organization  of 
the  CLIP4  system  is  shown  in  Fig.  1.4. 1.1  [Duf82].  Each  PE  has  32-bits  of 
memory  associated  with  it.  The  incoming  video  image  is  digitized  into  6-bit 
quantities  which  are  then  processed  bit-serially  (as  six  bit-planes)  by  the  96- 
by-96  array  of  PEs.  To  control  the  array,  extract  instructions,  and.  coordinate 
the  peripherals  associated  with  the  array,  a  controller  is  provided.  A  PDP- 
11/10  acts  as  host  for  the  system. 

A  PE  in  CLIP4  can  communicate  with  either  its  eight  nearest  neighbors  or 
its  six  nearest  neighbors  depending  on  which  communication  mode  is  selected. 
These  two  modes  are  shown  in  Fig.  1.4.1  2  [Duf82],  The  internal  organization 
of  a  PE  is  shown  in  Fig.  1.4. 1.3  [Duf82j.  The  boolean  processor  can  perform  all 
boolean  operations  on  single-bit  inputs.  Addition  (subtraction)  can  be  done  by 
performing  the  logical  operations  to  generate  the  sum  (difference)  and  then 
generating  the  carry  (borrow).  Carries  (borrows)  are  then  routed  through  the 


Fig.  1.4.1. 1  The  CLIP4  system  configuration  [Duf82] 


1.4. 1.3  The  complete  logic  circuit  for  CLIP4  [Duf82] 
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gating  array  for  use  in  calculating  the  next  bit. 

In  conclusion,  CLIP4  can  perform  picture  element  (pixel)  independent 
operations,  i.e.,  operations  where  each  pixel  is  treated  independently  of  its 
surrounding  pixels,  as  well  as  many  nearest  neighbor  operations.  CLIP!  is 
capable  of  performing  a  variety  of  image  processing  tasks  in  real-time. 

To  process  computationally  intensive  tasks,  the  Distributed  Array 
Processor  (DAP)  project  was  started  at  ICL  in  1972.  The  result  of  this  project 
was  a  64-by-64  array  of  PEs  called  the  ICL  DAP.  Unlike  CLIP4,  DAP  is  4- 
connected.  This  corresponds  to  a  subset  of  the  eight  nearest-neighbor 
interconnection  function  presented  in  Fig.  1.4.1. 2  consisting  of  connections  2,  4, 
6,  and  8.  The  architecture  of  the  PE  is  shown  in  Fig.  1.4.1. 4  [Red79].  The  ALU 
in  a  D.AP  PE  is  very  simple.  Many  logical  functions  must  be  broken  down  into 
sequences  of  .AND  and  NOT  operations. 

Instead  of  having  32-bits  of  memory  associated  with  each  PE,  like  CLIP4, 
the  DAP  PEs  have  4k-bits  of  RAM  associated  with  each  PE.  All  input  and 
output  to  DAP  is  done  through  the  hosts  memory,  i.e.,  the  DAP  memories  are 
a  portion  of  the  hosts  memory.  This  has  the  advantage  that  it  eliminates  idle 
transfer  time,  but  it  requires  the  DAP  to  be  used  in  conjunction  with  an  ICL 
2900  series  mainframe,  which  is  expensive  (  cost:  $  1,000,000  and  up)  [Ger83]. 
A  detailed  comparison  and  contrast  of  DAP,  CLIP4,  and  MPP  appears  in 
[Ger83]. 

STARAN  [Bat74,  Bat76]  is  a  bit-serial  system  that  differs  greatly  from 
CLIP4  and  DAP.  The  original  STARAN  is  composed  of  256  PEs,  a  256-by-256 
bit  Multi-Dimensional  Access  (MDA)  memory,  and  an  interconnection 
network.  The  MDA  memory  can  be  accessed  by  bit-slices,  byte-slices,  words, 
or  by  other  portions.  In  STARAN-E  [Bat77b],  the  MDA  memory  is  composed 
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of  up  to  256  256-by-256  bit  planes  of  memory.  STARAN-E  is  shown  in  Fig. 
1.4.1. 5  [Bat77b].  Instead  of  having  the  nearest-neighbor  interconnections,  like 
CLIP4  and  DAP,  STARAN  is  equipped  with  a  multistage  permutation  network 
called  the  flip  network.  This  is  a  multistage  cube  type  of  network  [Sie85],  Its 
capabilities  are  discussed  in  [Bat76j. 

Fig.  *  .4.1.6  [Thu76]  shows  the  layout  of  the  STARAN  memory  array. 
Two  registers,  (X  and  Y)  represent  256  1-bit  PEs.  The  logic  associated  with 
the  X-  and  Y-  register  can  perform  any  of  the  sixteen  Boolean  functions  of  two 
variables.  Inputs  for  the  two  variable  Boolean  functions  are  the  present  state 
of  the  register  and  the  input  from  the  permutation  network,  which  can  either 
be  memory  or  the  output  of  another  PE.  In  addition  for  PE  i,  either  X;  or  Yj 
may  be  used  as  a  mask  for  an  operation  on  the  other  register,  Fig.  1.4. 1.6  e.g., 
Xj  <—  f(Xj,networki)  if  Y;  =  1  (i=0,  1,  ...,  255).  The  status  of  determines 
which  memory  locations  are  modified  for  a  masked  write  operation.  Addition 
on  STARAN  is  demonstrated  in  [Bat74]. 

STARAN  was  designed  to  be  connected  to  a  variety  of  host  computers  as 
a  special  purpose  peripheral.  Three  systems  cited  in  [Bat74]  are:  a  DEC- 
PDP/11,  a  Honeywell  HIS-645,  and  an  XDS  E  5.  The  application  of  STARAN 
to  fast  Fourier  transformation,  sonar  post-processing,  and  air  traffic  control  are 
all  presented  in  [Bat74j.  The  application  of  STARAN  to  pattern  processing  is 
discussed  in  [Pot82j. 
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Fig.  1.4. 1.6  Internal  block  diagram  of  a  memory  array  (Thu76] 
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1.4.2.  STARAN,  DAP,  and  CLIP4  —  Comparisons  and  Contrasts 

The  design  of  STARAN  is  vastly  different  from  those  of  DAP  and  CLIP4. 
DAP  and  CLIP4  have  simple  nearest-neighbor  inter-processor  connections. 
STARAN’s  permutation  network,  is  more  costly.  For  simple  operations  on 
binary  arrays,  such  as  erosion  and  dilations,  the  DAP  and  CLDP4 
interconnection  patterns  are  simple  to  use.  However,  on  operations  such  as 
FFTs,  STARAN  can  use  the  permutation  network  for  performing  the  butterfly 
operations;  this  is  not  feasible  using  DAP  and  CLIP4. 

CLIP4  processors  can  address  a  small  amount  of  memory  (32-bits  each), 
DAP  processors  can  each  address  4K-bits  of  memory,  and  STARAN  processors 
share  one  common  memory  store  (some  number  of  256-by-256  bit  planes). 
Thus,  DAP  and  CLIP4  spend  no  time  fetching  and  storing  operands  and 
temporary  results  from  a  global  memory,  except  for  initial  loading  and  final 
unloading.  Both  STARAN  and  STARAN-E  with  bipolar  memory  have 
circumvented  the  problem  of  a  global  memory  becoming  a  system  bottleneck 
by  using  memory  that  is  faster  than  the  registers  on  either  DAP  or  CLIP 4  and 
that  is  as  fast  as  the  PE  registers  on  STARAN.  In  addition,  memory  is 
accessed  in  such  a  way  that  there  is  no  network  contention  [Bat77a].  Thus, 
there  is  no  penalty  for  having  the  remote  memory.  The  advantage  of  the 
scheme  used  for  STARAN  is  that  permuting  data  through  the  network  data 
does  not  involve  PE  operations.  For  example,  to  transmit  data  in  PE  i’s 
memory  to  PE  i  +  l’s  memory  requires  a  reconfiguration  of  the  network.  For 
both  CLIP4  and  DAP,  this  same  operation  would  require  a  read  from  memory, 
a  store  in  the  network  register,  a  read  from  the  network  register,  and  a  store  in 
local  memory.  Clearly,  the  scheme  used  for  STARAN  is  less  cumbersome  and 
less  time  consuming. 


The  bit-serial  nature  of  the  PEs  allow  a  great  deal  of  flexibility  of 
precision  and  representation  of  data.  The  PEs  composing  DAP  are  limited  to 
the  Boolean  AND  and  NOT  operations,  making  operations  such  as  addition 
complex  entities.  The  CLEP4  processor  is  capable  of  performing  the  Boolean 
AND,  OR,  and  EXCLUSIVE  OR  operations;  however,  the  architecture  of  the 
PEs  facilitates  addition.  STARAN  PEs  are  capable  of  performing  Boolean 
AND,  OR,  NOT,  TRUE,  FALSE,  and  EXCLUSIVE  OR.  In  addition, 
STARAN  PEs  can  perform  these  operations  with  up  to  three  arguments,  (the 
X-register,  the  Y-register,  and  input  from  the  MDA),  making  a  wide  variety  of 
operations  possible. 

CLIP4  PEs  have  a  small  amount  of  associated  memory,  increasing  control 
unit  overhead  for  tasks  that  require  more  than  32-bits  of  associated  memory  for 
parameters  and  constants.  DAP  PEs  have  a  larger  available  memory  (4K-bits). 
STARAN-E  avoids  this  problem  with  the  256  256-by-256  bit  planes  of  memory. 

Because  of  the  organization  of  all  three  arrays,  the  method  of  calculating  a 
function  of  a  few  variables  and  using  the  result  to  index  into  a  table  of  entries 
is  extremely  difficult,  as  the  result  of  the  calculation  must  be  globally 
transmitted  by  the  Control  Unit  to  each  PE.  According  to  [Ger83],  this  process 
may  be  faster  in  a  sequential  machine.  This  is,  however,  a  fault  with  bit-serial 
processing,  not  these  architectures. 

In  conclusion,  three  bit-serial  SIMD  architectures  have  been  introduced 
and  discussed.  The  bit-serial  architecture  lends  itself  well  to  a  wide  variety  of 
processing  tasks  and  data  precisions.  Bit-serial  processing  makes  operations  on 
words  (such  as  floating  point  addition)  more  difficult  because  the  operands  are 
processed  one  bit  at  a  time. 


1.5.  MIMD  Systems 


SIMD  systems  provide  an  environment  where  every  PE  performs  the  same 
operations  at  the  same  time.  Conditional  operations,  such  as:  if  (condition) 
then  {  A  }  else  {  B  }  require  all  PEs  not  satisfying  the  condition  to  be  idled 
while  the  remaining  PEs  execute  the  block  of  code  corresponding  to  “A.”  Upon 
their  completion,  the  active  PEs  are  idled  while  the  remaining  PEs  execute  the 
block  of  code  corresponding  to  “B.”  The  idling  of  PEs  reduces  the  potential 
gains  in  the  throughput  that  the  system  can  give.  For  some  tasks,  SIMD 
systems  may  not  give  desirable  performance.  MIMD  systems  may,  for  these 
tasks,  give  an  increased  throughput  over  SIMD  systems.  The  added  flexibility 
of  MIMD  systems  comes  with  an  increased  cost  of  overhead  to  perform 
synchronization  when  it  is  necessary.  There  are  certain  problems  tha  are  not 
appropriate  to  the  single  instruction  stream  limitations  of  SIMD  machines, 
justifying  the  extra  cost  of  MIMD  processing. 

The  architecture  of  a  bit-serial  MIMD  system,  Cytocomputer  will  be 
discussed  in  Section  1.5.1.  A  word-oriented  system,  PICAP  II,  will  be  discussed 
in  Section  1.5.2.  Two  more  word-oriented  systems  are  discussed  later.  The  CDC 
FP  irray  and  the  proposed  system  PASM  are  is  presented  in  detail  in  Chapter 
2. 


1.5.1.  Cytocomputer  —  A  Bit-serial  MIMD  System 

Cytocomputer  was  developed  at  the  Environmental  Institute  of  Michigan 
(ERIM)  to  perform  window  or  cell  based  image  processing  operations.  Its  name 
comes  from  the  Greek  word  “cyto,”  meaning  cell  [Ste80,  Lom80],  The  concept 
of  a  cell  accurately  describes  the  architecture  of  the  Cytocomputer.  With  DAP 


and  CLEP4,  there  is  one  PE  per  pixel.  An  interconnection  network  is  required 
for  window  based  operations.  Cytocomputer  uses  internal  storage  in  the  PEs 
to  achieve  the  nearest-neighbor  connectivity.  One  PE  performs  a  given 
operation  for  the  entirety  of  an  image,  greatly  reducing  the  number  of  PEs 
required.  This  significantly  reduces  the  complexity,  cost,  and  speed  of 
Cytocomputer  relative  to  CLIP4  and  DAP. 

The  architecture  of  Cytocomputer  is  simple  and  is  shown  in  Fig.  1.5. 1.1 
[Ste80,  L0M8O].  Cytocomputer  consists  of  K  (presently  80)  identical  stages  in  a 
pipeline.  Each  of  the  stages  is  a  fully  table-driven  cellular  logic  machine 
capable  of  performing  operations  involving  either  four,  six,  or  eight  nearest- 
neighbors.  In  addition,  each  stage  has  a  point-by-point  logic  function,  which  is 
capable  of  performing  non-neighborhood  operations,  such  as  thresholding. 

The  nearest  neighbor  connectivity  is  achieved  by  loading  data  from  the 
input  stream  (or  previous  stage)  into  a  shift-register,  as  shown  in  Fig.  1.5. 1.1. 
Only  nine  elements,  arranged  in  a  three-by-three  square,  in  the  shift  register 
are  accessible  at  one  time.  This  defines  the  neighborhood  function.  To  be 
consistent  with  [Ste80],  let  N  be  the  number  of  elements  in  a  row  of  an  image. 
Thus,  to  store  the  necessary  amount  of  information  to  process  a  three-by-tbree 
window,  2N+3  pixels  must  be  stored  by  each  stage  or  PE.  Windows  are 
achieved  as  shown  in  Fig.  1.5. 1.2  [Ste80j.  Results  of  calculations  are  passed  on 
to  the  next  stage  for  further  processing.  After  their  last  use,  the  input  data  to 
each  stage  are  discarded. 

Each  of  the  PEs  is  driven  by  a  common  clock  and  is  capable  of  performing 
independent  cell  (window)  operations.  For  each  of  the  80  stages,  the  time  for  a 
pixel  operation  is  640  ns.  Further  increases  in  throughput  are  possible  by 
adding  additional  stages  to  the  pipeline.  The  present  speed  of  Cytocomputer 


allows  it  to  perform  many  applications  at  a  real-time  rate.  Applications  of 
Cytocomputer  to  image  processing  tasks  are  discussed  in  [Ste80,  PrD79, 
L0M8O]. 

1.5.2.  PICAP  13  —  A  Word  Oriented  MIMD  Machine 

The  Picture  Array  Processor  (PICAP)  was  developed  at  Linkoping 
University  in  1972.  It  is  an  MIMD  system  with  up  to  sixteen  word  oriented 
processors  connected  to  a  shared  picture  memory  through  a  time-shared  high 
speed  bus.  The  “word-size"  each  processor  operates  on  is  a  64-by-64  window  of 
4-bit  integers.  The  architecture  of  PICAP  D  is  shown  in  Fig.  1.5.2. 1  [KrD82]. 
PICAP’s  picture  memory  consists  of  4  Mbytes  of  interleaved  RAM,  which  is 
sequentially  addressable.  With  this  architecture,  PICAP  is  capable  of 
processing  multiple  images  simultaneously  with  little  overhead. 

Tasks  that  are  too  large  for  a  single  PICAP  processor  can  be  subdivided 
and  placed  on  different  processors.  This  offers  a  great  deal  of  flexibility  when 
applying  PICAP  to  large  image  processing  tasks. 

For  PICAP  II,  the  shared  bus  is  capable  of  transmitting  4xl07  pixels  per 
second,  40  times  greater  than  that  of  its  SIMD  predecessor,  PICAP  I.  The  host 
computer  is  a  PDP-11  series  computer  that  is  also  used  to  oversee  the  operation 
of  the  system.  PICAP  has  a  real-time  video  input  and  monitor,  which  allows 
interactive  image  processing  of  image  data.  Pictures  are  interactively  processed 
on  PICAP  through  a  structured  high-level  language  called  Picture  Processing 
Language  (PPL)  (KrD82],  which  allows  interactive  processing,  loading,  and 
display  of  images.  A  FORTRAN  interface  is  also  available. 


terminal 


Fig.  1.5. 2.1  PICAP  II  system  architecture  [Krd82] 
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A  discussion  of  both  PICAP  I  and  PICAP  II  can  be  found  in  [KrD82j. 
Applications  of  PICAP  D  to  image  processing  tasks  can  be  found  in  [KrG82]. 


1.6.  Conclusions 

Several  S1MD  and  MIMD  systems  for  image  processing  to  were  discussed. 
Both  word-oriented  and  bit-serial  architectures  were  presented.  General 
descriptions  and  applications  of  a  wide  variety  of  processors  for  image 
processing  may  be  found  in  the  following  books:  [Duf83],  [DuL81],  [Ful82],  and 
[PrU82]. 


CHAPTER  2 

PARALLEL  PROCESSING  IMPLEMENTATIONS 
OF  A  CONTEXTUAL  CLASSIFIER 


2.1  Introduction 

Multispectral  image  data  collected  by  remote  sensing  devices  aboard 
aircraft  and  spacecraft  are  relatively  complex  data  entities.  Both  the  spatial 
attributes  and  spectral  attributes  of  these  data  are  known  to  be  information 
bearing  [SwD78],  but  to  reduce  the  computation  involved,  most  analysis  efforts 
have  focused  on  one  or  the  other.  Characteristic  spatial  features  include,  for 
example,  shape,  texture,  and  structural  relationships.  Useful  research  has  been 
accomplished  in  the  direction  of  incorporating  spatial  information  into  the  data 
analysis  process  (e.g.,  [HaS73],  [KeL76],  [WeD76]). 

The  “class”  associated  with  a  given  pixel  is  not  independent  of  the  classes 
of  adjacent  pixels.  Stated  in  terms  of  a  statistical  classification  framework, 
there  may  be  a  better  chance  of  correctly  classifying  a  given  pixel,  if  in 
addition  to  the  spectral  measurements  associated  with  the  pixel  itself,  the 
measurements  and/or  classifications  of  its  “neighbors”  are  considered  as  well. 
The  image  can  be  considered  to  be  a  two-dimensional  random  process 
incorporated  into  the  classification  strategy.  This  is  the  objective  of 
“contextual  classifiers”  [WeS7l],  in  which  a  form  of  compound  decision  theory 
is  employed  through  the  use  of  a  statistical  characterization  of  context.  Recent 
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investigations  have  demonstrated  the  effectiveness  of  a  contextual  classifier  that 
combines  spatial  and  spectral  information  by  exploiting  the  tendency  of  certain 
ground-cover  classes  to  occur  more  frequently  in  some  spatial  contexts  than  in 
others  [S wS80] ,  [SwV8 1] ,  (TiS8 1] ,  [W eS7 1] . 

The  practical  utilization  of  this  contextual  classifier  in  remote  sensing  has 
awaited  the  solution  of  two  key  problems:  (1)  lack  of  an  effective  method  for 
characterizing  and  extracting  contextual  information  in  multispectral  remote 
sensing  imagery,  and  (2)  the  need  to  reduce  the  execution  time  of  the  very 
computation-intensive  contextual  classification  algorithm.  The  first  of  these 
problems  has  been  solved  by  development  of  an  unbiased  estimation  procedure 
which  provides  a  good  characterization  of  the  contextual  information  without 
requiring  exorbitant  amounts  of  classifier  training  data  (“ground  truth’’) 
[TiS81].  Although  the  resulting  improvement  in  classification  accuracy  is 
significant  compared  to  conventional  no-context  statistical  classification 
methods,  the  practicality  of  the  contextual  classifier  depends  on  the  solution  of 
the  second  problem,  the  subject  of  this  chapter. 

A  reduction  in  the  execution  time  of  classification  algorithms  such  as  the 
contextual  classifier  (and  even  much  simpler  algorithms  used  for  remote  sensing 
data  analysis)  can  be  achieved  through  the  use  of  parallelism.  There  are  several 
types  of  parallel  processing  systems.  An  SIMD  (Single  Instruction  stream  -- 
Multiple  Data  stream)  machine  (Fly66)  typically  consists  of  a  control  unit,  N 
processors,  N  memory  modules,  and  an  interconnection  network.  The  control 
unit  broadcasts  instructions  to  all  of  the  processors,  and  all  active  (enabled) 
processors  execute  the  same  instruction  at  the  same  time.  Each  active  processor 
executes  the  instruction  on  data  in  its  own  associated  memory  module.  The 
interconnection  network  provides  a  communications  facility  for  the  processors 


and  memory  modules.  An  MIMD  (Multiple  Instruction  stream  --  Multiple 
Data  stream)  machine  [Fly66]  typically  consists  of  N  processors  and  N  memory 
modules,  where  each  processor  can  follow  an  independent  instruction  stream. 
As  with  SIMD  architecture,  there  is  a  multiple  data  stream  and  an 
interconnection  network.  CDC  Flexible  Processor  (FP)  systems  are  MIMD 
architectures  that  have  been  built  [CDC77aj,  [CDC77b].  PASM  is  a  proposed 
p&rtitionable  glMD/MIMD  multimicroprocessor  system  for  image  processing 
and  pattern  recognition  (SiS81j.  For  this  application,  the  use  of  PASM  in  the 
SIMD  mode  of  operation  will  be  considered. 

Maximum  likelihood  classification  [SwD78],  often  used  in  remote  sensing, 
classifies  each  pixel  independently  of  all  others.  Using  either  the  SIMD  or 
MIMD  mode  of  parallelism,  the  image  can  be  subdivided  among  the  processors, 
each  processor  classifying  its  own  subimage.  Thus,  N  processors  would  be  able 
to  execute  maximum  likelihood  classification  approximately  N  times  faster 
than  one  processor  of  the  same  type.  However,  parallel  implementations  of 
contextual  classifiers  are,  in  general,  not  so  straightforward,  due  to  the  use  of 
neighborhood  information.  The  way  in  which  parallel  machines  such  as  the 
CDC  FP  system  and  PASM  perform  contextual  classifications  is  examined  in 
the  following  sections. 

Section  2.2  briefly  describes  contextual  classification  and  gives  a 
uniprocessor  algorithm  for  performing  it.  The  implementation  of  a  contextual 
classification  algorithm  on  an  FP  system  and  a  comparison  of  the  timings 
obtained  on  an  FP  system  simulator  to  those  obtained  on  a  PDP- 11/70  are 
discussed  in  Section  2.3.  In  Section  2.4,  the  way  in  which  PASM  can  be 
applied  to  contextual  classification  is  considered. 


2.2.  Contextual  Classification 

2.2.1.  Definitions 

The  image  data  to  be  classified  are  assumed  to  be  a  two-dimensional  I-by- 
J  array  of  multivariate  pixels.  Associated  with  the  pixel  at  “row  i”  and 
“column  j”  is  the  multivariate  measurement  n-vector  e  Rn  and  the  true 
class  of  the  pixel  c  fl  =  {wj,  .  .  .  ,  wc}  The  measurement  vectors  have 
class-conditional  densities  f(X|a;k),  k  =  1,2,...,C,  and  are  assumed  to  be 
class-conditionally  independent.  The  objective  is  to  classify  the  pixels  in  the 
array. 

In  order  to  incorporate  contextual  information  into  the  classification 
process,  when  each  pixel  is  to  be  classified,  p-1  of  its  neighbors  are  also 
examined.  This  neighborhood,  including  the  pixel  to  be  classified,  will  be 
referred  to  as  the  p-array.  To  classify  each  pixel,  the  contextual  classifier 
computes  the  probability  of  the  given  observed  pixel  being  in  class  k  by  also 
considering  the  measurement  vectors  (values)  observed  for  the  neighbor  pixels 
in  the  p-array.  Specifically,  for  each  pixel,  for  each  class  in  Q,  a  discriminant 
function  g  is  calculated.  The  pixel  is  assigned  to  the  class  for  which  g  is  the 
greatest.  Each  value  of  g  is  computed  as  a  weighted  sum  of  the  product  of 
probabilities  based  on  the  pixels  in  the  neighborhood.  This  is  described  below 
mathematically  for  pixel  (i,j)  being  in  class  u;k.  (The  description  is  followed  by 
an  example  to  clarify  the  notation  used.  Further  details  may  be  found  in 
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where 

X7  c  X,j  is  the  measurement  vector  from  the  7th  pixel  in  the  p-array  (for 
pixel  (i.j ) ) 

07  <  fiij  is  the  class  of  the  7th  pixel  in  the  p-array  (for  pixel  (i,j)) 

f(X7|  e7)  is  the  class-conditional  density  of  X7  given  that  the  7th  pixel  is 
from  class  07 

=  G(e„©2,...,ep  )  is  the  a  priori  probability  of  observing  the  p-array 

e,,e2, . . .  ,0p 

Within  the  p-array,  the  pixel  locations  may  be  numbered  in  any 
convenient,  but  fixed  order.  The  joint  probability  distribution  Gp  is  referred  to 
as  the  context  distribution.  The  class-conditional  density  of  pixel 
measurement  vector  X  given  that  the  pixel  is  from  class  k  is: 

-[log  |  Sj  +(X-mt)Ti:t-l(X-mt)] 

f(X|  Wk>  =  e  2 


where  the  measurement  vector  for  each  pixel  is  of  size  four,  1  is  the  inverse 
of  the  covariance  matrix  for  class  k  (four-by-four  matrix),  mk  is  the  mean 


vector  for  class  k  (size  four  vector),  “T”  indicates  the  transpose,  “log”  is  the 
natural  logarithm,  and  |  £k|  is  the  determinant  of  the  covariance  matrix.  This 
is  the  same  function  as  used  for  the  maximum  likelihood  classification  [SwD78|. 

Consider,  as  an  example,  the  horizontally  linear  neighborhood  shown  in 
Fig.  2.2.1.1(a),  where  pixel  (i,j)  is  the  middle  pixel,  and  assume  there  are  two 
possible  classes:  f)  =  {a,b}.  Then  the  discriminant  function  for  class  b  is 
explicitly: 

Eb(Xij)  =  f(X,|  a)f(X2|  b)f(X3|  a)G(a,b,a) 

+  f(X1|a)f(X2|b)f(X3|b)G(a,b,b) 

+  f(X1|b)f(X2|b)f(X3|a)G(b,b,a) 

+  f(X1|b)f(X2|b)f(X3|b)G(b,b,b) 

After  computing  the  discriminant  functions  of  gs  and  gb  for  pixel  (i,j),  pixel 
(i.j)  is  assigned  to  the  class  which  has  the  larger  discriminant  value.  (Edge 
pixels  of  the  image  not  having  the  appropriate  p-1  neighbors  are  not 
classified.) 

Consider  the  case  where  there  is  a  non-linear  three-by-three  context  array 
(neighborhood),  as  shown  in  Fig.  2.2.1.1(b).  Here,  for  each  g,  with  C  classes, 
there  are  C8  product  terms  with  nine  factors  in  each  term.  In  general,  for  each 
g,  there  are  Cp_1  product  terms,  each  term  having  p  +  1  factors.  In  the 
LANDSAT  data  used  in  the  testing  described  in  [TiS81],  the  percentage  of 
non-zero  Gfl,s  was  about  1 %  (based  on  a  size  nine  neighborhood  and  14  classes), 
so  to  conserve  space  and  to  increase  throughput,  only  non-zero  Gpf  s  are  stored. 
This  technique  will  be  discussed  in  later  sections.  All  of  the  calculations  are 
done  using  floating  point  data. 


2.2.2.  Uniprocessor  Algorithm 


The  algorithm  shown  in  Fig.  2.2.2. 1  is  a  uniprocessor  implementation  of 
the  size  three  contextual  classifier.  f(X[©^)  is  independent  of  the  position 
within  a  window,  and  thus  does  not  change  when  a  window  is  moved.  This 
algorithm  is  consistent  with  the  theory  presented  above;  however,  to  minimize 
execution  time,  an  array  (called  “hold”  is  used  to  store  “compf”  values.  Since 
f(X|  ©,)  is  required  for  all  windows  that  contain  pixel  X,  redundant 
calculations  may  be  eliminated  by  storing  f(X|  ©7)  in  a  temporary  array.  The 
stored  f(X|  ©7)  is  discarded  when  pixel  X  will  no  longer  appear  in  any  windows, 
j  For  the  uniprocessor  implementation,  the  temporary  array  is  called  “hold.” 

Let  “hold(m,k)”  be  a  two-dimensional  array  of  size  three-by-C,  i.e., 
0<m<2  and  l<k<C.  “hold(cr,k)”  (statement  S5)  is  a  vector  of  length  C 
|  containing  the  class-conditional  density  values  (“compf”  values,  statement  S3) 

for  the  pixel  (i,j)  (“cr”  is  an  abbreviation  for  center).  “hold(lt,k)”  (statement 
S4)  and  “hold(rt,k)”  (statement  S6)  are  the  analogous  vectors  for  the  pixel 
|  ( i,j— 1 )  (the  left  neighbor)  and  pixel  (i,j+l)  (the  right  neighbor),  respectively. 

By  using  this  array  to  save  the  class-conditional  densities,  each  density  (for  a 
given  pixel  and  class)  is  calculated  only  once. 

I  The  algorithm  calculates  the  class-conditional  densities  for  the  first  three 

columns  each  time  a  new  row  is  to  be  classified  and  stores  them  in  “hold.” 
(statement  S3).  Each  time  a  new  pixel  in  a  given  row  is  to  be  classified 
(statement  S7),  the  pointers  to  these  values  in  “hold”  are  updated  (statement 
S17).  In  particular,  the  data  in  “It”  is  disposed  of,  “It”  is  updated  to  point  to 
the  data  previously  pointed  to  by  “cr”,  “cr”  points  to  the  data  previously 
pointed  to  by  “rt”,  and  “rt”  points  to  the  newly  calculated  data  (statement 
S17)  for  the  incoming  pixel. 

j 


Main  Loop 

(or  i  =  0  to  1-1  do  /*  row  index  */ 

fork  =  1  to  C  do  /*  for  each  class  */ 

for  m  =  0  to  2  do  hold(m,k)  =  compf(i,m,k)  /*  cols. 0-2  */ 
It  =  0  /*  hold(lt,k,)  is  left  neighbor  */ 
cr  =  1  /*  ho!d(cr,k)  is  pixel  being  classified  */ 
rt  =  2  /*  hold(rt,k)  is  right  neighbor  */ 
for  j  =1  to  J-2  do  /*  column  index  */ 

value  =  -1;  class  =  -1  /*  max  “g”  and  class  */ 
for  k  =  1  to  C  do  /*  for  each  class  */ 
current  =  g(lt,cr,rt,k) 

if  current  >  value  /*  compare  with  max  */ 
then  value  =  current;  class  —  k 
print  p ix e  1  ( i , j )  is  classified  as  "class'’ 
if  j  *  J-2  then  /*  update  hold  pointers  */ 
tp  =  It;  It  -  cr;  cr  =  rt;  rt  =  tp 
for  k  =  1  to  C  do  /*  compf’s  for  next  col  */ 
hold(rt,k)  =  compf(i,j  +  2,k) 


Discriminant  Function  Calculation 
fuction  g(lt,cr,rt,k)  /*  for  pixel  cr,  class  k  */ 
sum  -  0  /*  initialize  sum,  used  to  accumulate  g  */ 
for  r  =  I  to  C  do  /*  all  classes  for  pixel  (i.j-1)  */ 

for  q  =  1  to  C  do  /*  all  classes  for  pixel  (i.j  + 1 )  */ 
if  G(r,k,q)  *  0  /*  do  not  multiply  if  G  =  0  */ 
then  sum  =  ho!d( It ,r)  *  hold(cr,k) 

*  hold(rt,q)  *  G(r,k,q)  +  sum 
return  (sum)  /*  sum  contains  value  of  g(it,cr.rt,k)  */ 

Class-Conditional  Density  Calculation 

function  compf(a.b.k)  /*  for  pixel  (a,b),  class  k  * 

x  =  A(a,b)  /*  x  is  the  pixel  (a.b)  measurement  vector  */ 

expo  =  [log |  Lk|  +  (x-mfc)TSk‘ '(x— mk )]/2 

return  (efxpo)  /*  return  value  of  f(A(a,b)|j)  */ 

Fig.  2.2.2. 1  Uniprocessor  implementation  of  size 

three  contextual  classifier  algorithm  (p =2) 


The  complexity  of  the  algorithm  is  proportional  to  I*J*C3  assignments, 
multiplications,  and  additions,  and  I*J*C  “compf”  calculations.  Typically, 
10<C<60  for  the  analysis  of  LANDSAT  data. 

The  algorithm  can  be  extended  for  a  non-linear  contextual  classifier  with  a 
neighborhood  of  size  nine  (as  shown  in  Fig.  2.2.1.1(b)).  The  complexity  of  the 
algorithm  would  have  growth  proportional  to  I*J*C°  assignments, 
multiplications,  and  additions.  The  number  of  “compf”  calculations  would  still 
be  I*J*C.  In  this  case,  “hold”  would  be  a  (2*J  +  3)-by-C  array  (assuming  the 
neighborhood  window  moves  along  rows).  Fig.  2. 2. 2. 2  shows  the  pixels  whose 
“compf”  values  are  stored  in  the  “hold”  array.  The  2*J  +  3  pixels  whose 
“compf”  values  are  stored  in  “hold”  are  chosen  to  make  it  unnecessary  to 
perform  redundant  “compf”  calculations.  In  general,  when  classifying  pixel  (i,j), 
“hold”  has  the  “compf”  values  for  pixels  j-I  to  J-l  of  row  i— 1,  pixels  0  to  J-l 
(all)  of  row  i,  and  pixels  0  to  j  + 1  of  row  i  +  1.  After  the  classification  of  pixel 
(i.j),  the  values  for  (i  +  l,j+2)  are  added  and  the  values  for  (i — l,j — 1)  are 
removed.  When  the  pixels  on  a  new  row  are  to  be  classified,  call  it  i' ,  then  the 
values  for  pixels  (i'  -2,  J-3),  (i'  -2,J-2),  and  (if  -2,  J-l)  are  removed  and  the 
values  for  (i'  +1,0),  (i'  +1,1),  and  (i'  +1,2)  are  added.  (This  assumes  row  i'  is 
classified  after  i'-l.)  Given  this,  the  rest  of  transforming  the  algorithm  for  the 
size  nine  square  neighborhood  case  is  straightforward. 

In  summary,  the  uniprocessor  one-by-three  algorithm  was  presented.  The 
extension  to  the  three-by-three  case  was  discussed.  Extension  to  other  size  and 
shape  neighborhoods  is  similar.  The  next  two  sections  discuss  parallel 
implementations  using  FPs  and  PASM  respectively. 


2.3.  MIMD  Implementation  on  the  CDC  Flexible  Processor  System 
2.3.1.  Flexible  Processor  System 

The  Control  Data  Corporation  Flexible  Processor  (FP)  system  is  a 
multiprocessor  system  which  has  been  recommended  for  use  in  remote  sensing. 
The  basic  components  of  an  FP  are  shown  in  Fig.  2.3. 1.1.  There  can  be  up  to 
16  FPs  linked  together,  providing  much  parallelism  at  the  processor  level.  The 
FPs  can  communicate  among  themselves  through  a  high-speed  ring  or  shared 
bulk  memory.  A  possible  FP  system  configuration  is  presented  in  Fig.  2.3. 1.2. 

The  instruction  cycle  time  of  each  FP  is  125  nsecs.  An  FP  is  programmed 
in  micro-assembly  language,  allowing  parallelism  at  the  instruction  level.  For 
example,  it  is  possible  to  conditionally  increment  an  index  register,  execute  a 
program  jump,  multiply  two  8-bit  integers,  and  add  two  32-bit  integers  --  all 
simultaneously.  This  type  of  operational  overlap,  in  conjunction  with  the 
capability  to  use  up  to  16  FPs  in  parallel,  greatly  increases  the  speed  of  the  FP 
system. 

The  following  list  summarizes  the  important  architectural  features  of  an 
FP: 

User  microprogrammable. 

Dual  16-bit  internal  bus  system. 

Able  to  operate  with  either  16-  or  32-bit  words. 

125  nsec,  instruction  cycle  time. 

125  nsec,  time  to  add  two  32-bit  integers. 

250  nsec,  time  to  multiply  two  8-bit  integers. 

Register  files  of  over  8000  16-bit  words. 

60  nsec,  read/write  time  for  register  files. 


ft  *. 
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Up  to  16  banks  of  250  nsec,  bulk  memory  (each  bank  holds  64K  words). 

In  order  to  debug,  verify  and  time  FP  algorithms,  a  simulator  and  an 
assembler  were  developed  for  a  system  of  up  to  16  FPs.  The  experience  gained 
through  the  use  of  the  simulator  has  made  evident  the  following  advantages 
and  disadvantages  of  the  FP  system. 

Advantages: 

Multiple  processors  (up  to  16) 

User  microprogrammable  --  parallelism  at  the 
instruction  level 

Connection  ring  for  inter-FP  communications 

Shared  bulk  memory  units 

Separate  arithmetic  logic  unit  and  hardware  multiply 

Disadvantages: 

No  floating  point  hardware 

Micro-assembly  language  --  difficult  to  program 

Program  memory  limited  to  4K  microinstructions 

Both  the  simulator  and  the  assembler  are  designed  to  operate  under  the 
UNIX  operating  system.  They  are  described  in  [SmS80],  More  details  about 
the  FP  system  can  be  found  in  [SmS80],[SwS80],[CDC77a],[CDC77b]. 


2.3.2.  Linear  Contextual  Classifiers 

Consider  using  an  N  (<16)  FP  system  to  implement  the  contextual 
classifier  based  on  a  horizontally  linear  neighborhood  of  size  three  (Fig. 
2.2.1.1(a)).  Divide  the  I-by-J  image  into  subimages  of  I/N  rows  J  pixels  long, 
as  shown  in  Fig.  2.3.2.I.  This  method  of  dividing  the  image  is  called  striping. 
Assign  each  subimage  to  a  different  FP.  The  entire  neighborhood  of  each  pixel 
is  included  in  its  subimage.  No  interaction  between  FPs  is  needed,  i.e.,  each  FP 
can  process  its  subimage  independently.  A  perfect  factor  of  N  improvement 
speedup  over  a  single  FP  occurs  if  I  is  a  multiple  of  N.  The  degradation  in 
performance  that  arises  when  I  is  not  a  multiple  of  N  is  less  than  1 %  for  large 
images  [SwS80]. 

An  FP  micro-assembly  language  version  of  the  algorithm  stated  in  Fig. 
2.2.2. 1  was  written.  Because  each  FP  is  microprogrammabie,  determining 
program  correctness  and  analyzing  the  execution  time  are  done  through  the  use 
of  the  micro-assembler  and  simulator.  All  floating  point  operations  are  done  in 
software.  Mantissa  normalization  of  all  floating  point  operands  gives  rise  to  a 
variation  in  the  overall  execution  time  per  pixel.  This  variation  can  be  as  much 
as  10:1  [SmS80]. 

Each  pixel  measurement  vector  consisted  of  four  32-bit  floating  point 
representations  of  8-bit  integers;  the  input  data  were  converted  to  floating 
point  notation  prior  to  the  execution  of  the  classifier.  This  conversion  is  not 
included  in  either  the  FP  or  comparative  PDP-11  timings.  Covariance  matrices 
consisted  of  ten  32-bit  floating  point  numbers.  Further,  32-bit  floating  point 
numbers  were  used  to  represent  the  logarithms  of  the  determinants  of  the 
covariance  matrices  and  the  a  priori  probabilities.  The  pixel  measurement 
vectors,  covariance  matrices,  logarithms  of  the  determinants  of  the  covariance 


matrices,  a  priori  probabilities,  and  a  temporary  variable  array  are  all  stored  in 
the  “large  file”  (see  Fig.  2.3.1. 1).  Thus,  in  this  case,  each  FP  has  all  the 
information  it  needs  for  performing  the  classification  on  its  subimage  stored  in 
its  register  file  and  no  “bulk  memory”  accesses  are  required. 

If  the  number  of  non-zero  a  priori  probabilities  is  small  (less  than  50%), 
and  the  contextual  information  (configuration  of  classes)  associated  with  each 
Gp  can  be  stored  in  the  space  of  one  floating  point  number  (32  bits),  then  any 
algorithm  that  stores  all  a  priori  probabilities  will  waste  memory  space.  This  is 
the  case  in  the  LANDSAT  data  used  for  this  experiment.  Each  Gp  is  stored  as 
two  32-bit  quantities.  The  first  32-bit  quantity  contains  information  about  the 
class  of  each  pixel  within  the  p-array.  For  example,  if  G(3,3,2)  is  non-zero,  the 
word  preceding  it  is  a  representation  (catenation)  of  3,3,  and  2.  This  allows 
1.32/p]  bits  per  class,  i.e.,  up  to  2^!^  classes.  (Thus,  for  the  size  three 
neighborhood  being  considered,  C  can  be  as  large  as  1024.)  The  second  32  bits 
is  the  value  of  the  Gp  itself.  Only  the  non-zero  Gps  are  stored,  so  only  the  non¬ 
zero  Gps  affect  the  computation  time. 

For  larger  windows  (larger  p),  it  is  possible  that  2^32/p^  will  not  be  large 
enough  to  include  all  possible  classes.  If  this  occurs,  one  or  two  additional  32- 
bit  words  can  be  used  to  store  the  class  information  about  the  p-array.  In  such 
cases,  the  non-zero  Gps  would  have  to  be  less  than  30%  or  25%  respectively  in 
order  for  this  scheme  not  to  require  additional  space.  As  stated  previously, 
based  on  an  analysis  performed,  the  percentage  of  non-zero  Gps  is  much  smaller 
than  this. 


When  this  memory  arrangement  is  employed,  the  needed  class  information 
is  obtained  by  masking  off  the  desired  bits  and  shifting  the  result  right 
(producing  a  number  between  0  and  2 l32k/pl— i ,  where  k  is  a  number  between  1 


and  3  depending  on  the  number  of  words  used  to  store  the  class  information.) 
If  the  desired  information  does  not  cross  a  word  boundary,  this  operation  will 
require  3p  steps  per  non-zero  Gp  (load,  logical  and,  shift),  otherwise  it  will 
require  7p  steps  per  non-zero  Gp  (load,  logical  and,  shift,  load,  logical  and,  add, 
shift.)  Consider,  instead,  using  the  straight  forward  approach  of  storing  all  Gps, 
both  zero  and  non-zero.  For  a  window  of  size  p,  a  p-element  vector  (containing 
elements  between  0  and  C— 1)  is  required  in  order  to  create  the  pc  possible 
window  configurations.  Incrementing  an  index  value  requires  four  operations 
consisting  of:  storing  the  address  of  the  index  in  the  large  file  address  register, 
reading  the  index  from  the  large  file,  incrementing  the  index,  and  the  storing 
the  new  value  in  the  large  file.  This  is  done  each  time  an  index  is  incremented. 
In  addition,  each  time  an  index  is  incremented,  it  must  be  compared  to  the  C. 
If  it  equals  C,  it  should  be  set  to  0  and  the  next  index  incremented.  2p 
operations  are  required  (store  address  of  index  in  large  file  address  register  and 
store  initial  value  of  index)  to  initialize  the  indices.  Thus,  the  time  required  to 

handle  the  indices  for  this  scheme  is  2p  +  5^(C')  steps  per  Gp  (zero  or  non- 

i=l 

zero.)  Thus,  the  proposed  algorithm  will  not  only  be  more  space  efficient,  but  it 
will  run  faster. 

For  the  purposes  of  testing  the  FP  implementation  of  the  one-by-three 
linear  contextual  classifier  program,  measurement  vectors  from  30  rows  of  16 
pixels  were  classified.  The  data  set  consisted  of  a  four-class  subset  of  the 
LANDSAT  data  used  in  [SwV81].  To  provide  a  basis  for  comparison,  a  similar 
contextual  classifier  was  run  on  a  PDP-11/70  over  the  same  test  data.  It  was 
found  that  lack  of  exponent  range  in  the  11/70  floating  point  hardware 
required  extra  handling.  FP  floating  point  algorithms  are  implemented  in  the 


software,  so  a  14-bit  exponent  was  used  to  overcome  this  problem.  A 
description  of  the  floating  point  software  is  available  in  [SmS80].  The  FP  “e” 
calculations  are  based  on  those  in  [Har68|.  Twenty  non-zero  Gps  were  chosen 
for  the  benchmark  tests.  Running  under  the  above  constraints,  the  single  FP 
classifier  took  .035  secs. /pixel,  while  the  PDP-11/70  required  .050  secs./pixel,  a 
30%  improvement. 

Using  .05  secs,  per  pixel  as  the  PDP  processing  time  and  .035  secs,  per 
pixel  as  the  single  FP  processing  time,  a  16  FP  configuration  would  perform 
contextual  classifications  at  a  rate  of  457  pixels  per  sec.,  as  opposed  to  20 
pixels  per  sec.  for  a  single  PDP-11/70.  There  are,  of  course,  cost  differences 
between  these  two  systems;  however,  the  purpose  here  is  to  show  the  gains 
made  possible  by  a  multiprocessor  FP  system.  In  general,  different  size 
horizontally  linear  (Fig.  2.3.2.2(a)),  vertically  linear  (Fig.  2.3.2.2(b)),  and 
diagonally  linear  neighborhoods  (Fig.  2.3.2.2(c))  of  various  sizes  can  be 
processed  in  a  manner  similar  to  that  for  the  horizontally  linear  neighborhood 
of  size  three  (SwS80j. 

2.3.3.  Non-linear  Contextual  Classifiers 

Consider  non-linear  neighborhoods,  that  is,  neighborhoods  which  do  not  fit 
into  one  of  the  linear  classes.  For  example,  all  of  the  neighborhoods  in  Fig. 
2.3.3. 1  are  non-linear.  It  can  be  shown  that  there  is  no  way  to  partition  an 
image  into  N  (not  necessarily  equal)  sections  such  that  a  contextual  classifier 
using  a  non-linear  neighborhood  can  be  performed  without  data  transfers 
among  FPs  [SwS80].  The  specific  non-linear  case  under  consideration  is  the 
three-by-three  non-linear  neighborhood,  shown  in  Fig.  2.2.1.1(b).  First,  the 


(c)  Diagonally  linear  neighborhoods 
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single  FP  timings  are  considered,  then  the  timings  for  an  system  of  N  FPs  are 
considered. 

The  eight-nearest  neighbor  contextual  classifier  is  similar  to  the  previously 
described  linear  case.  Differences  arise  in  the  calculation  of  the  discriminant 
function  (discussed  in  Section  2.2.1),  the  method  of  updating  the  “hold”  data 
for  a  given  window  (discussed  in  Section  2.2.2),  and  the  method  of  data  storage 
(discussed  below). 

Timings  run  from  LANDSAT  data  from  [SwV81]  show  that,  on  the 
average,  the  FP  implementation  of  the  four-class,  size  nine  square 
neighborhood  contextual  classifier  with  all  data  entries  and  a  priori  information 
stored  in  the  large  file  requires  .137  secs./pixel.  A  PDP-11/70  implementation 
of  the  same  algorithm  requires  .154  secs./pixel.  Thus,  there  is  an  ll^o 
improvement.  The  improvement  is  not  as  much  for  this  case  as  in  the  size  three 
horizontally  linear  case  because  the  FP  performs  floating  point  operations  in 
the  software.  The  more  terms  in  the  product  term,  the  more  time  the  FP  will 
spend  normalizing  intermediate  results.  Tests  for  the  11/70  were  run  with  50 
non-zero  Gps  and  four  spectral  classes  on  52  lines  of  16  pixels.  A  30-line-by- 16- 
pixel  subset  of  the  above  image  was  used  to  derive  the  FP  timings  for  a  52-line 
image.  Pixels  on  the  top  and  bottom  line  of  an  image  are  not  classified,  and 
thus  do  not  appear  in  the  number  of  classified  pixels.  As  a  result,  for  the  first 
and  last  rows  of  an  image,  the  classifier  must  calculate  the  class  conditional 
probabilities  for  these  pixels  without  ever  classifying  them.  Therefore,  the 
results  are  slightly  biased  in  favor  of  the  11/70  implementation.  Once  again, 
only  the  non-zero  Gps  are  stored,  so  only  the  non-zero  Gps  affect  computation 


Lsing  .154  secs,  per  pixel  as  the  PDP  processing  time  and  .137  secs,  per 
pixel  as  the  single  FP  processing  time,  a  16  FP  system  would  perform 
contextual  classifications  at  a  rate  of  approximately  116  pixels  per  sec.,  as 
opposed  to  the  6  pixels  per  sec.  rate  of  a  single  PDP-11/70.  This  assumes, 
however,  that  all  needed  data  are  stored  in  the  large  file,  a  somewhat 
unrealistic -assumption.  The  use  of  the  bulk  memories  for  storing  and  sharing 
data  is  discussed  in  the  next  three  sections. 


2.3.4.  Processing  of  Images  with  Large  Numbers  of  Gps 

If  the  a  priori  probabilities  are  too  large  to  fit  in  the  register  files,  bulk 
memory  can  be  used  to  store  the  overflow  Gps.  The  width  of  the  bulk  memory 
is  16  bits.  Each  Gp  is  composed  of  either  two,  three,  or  four  32-bit  quantities. 
One  contains  the  Gp  itself,  while  the  rest  is  the  contextual  information 
associated  with  a  given  pixel  (see  2.3.2).  A  64-bit  Gp  can  be  accessed  with  four 
reads  from  bulk  memory,  while  a  96-bit  read  can  be  accessed  with  six  reads, 
and  a  128-bit  Gp  can  be  accessed  with  eight  reads.  One  of  the  special  features 
associated  with  an  FP  is  that  every  time  a  read  from  bulk  memory  is 
performed,  the  pointer  to  bulk  memory  is  automatically  incremented  [CDC77a]. 
A  read  from  bulk  memory  is  accomplished  in  two  steps  [CDC77a],  [CDC77b]. 
First  the  read  must  be  initialized  and  second  (after  .250/i-secs.)  the  data  must 
be  read  from  the  bulk  memory  [CDC77a].[CDC77b],  On  the  surface,  it  would 
appear  that  a  16-bit  read  requires  four  clock  cycles;  however,  this  is  not  the 
case.  The  read  can  be  initialized  in  parallel  with  other  operations;  thus  no  time 
is  lost  due  to  the  initialization.  An  FP  can  wait  for  the  data  or  it  can  execute 
other  instructions  in  the  meantime.  Thus,  the  total  cost  of  a  read  from  bulk 
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memory  is  one  instruction  cycle  per  1G  bits.  The  cost,  then,  of  accessing  a  Gp 
and  its  corresponding  context  configuration  from  the  bulk  memory  is  2  4-  2k 
instruction  cycles,  or  250/i-secs  +  kx250/i-secs,  where  k  is  the  number  of 
words  used  to  store  the  context  information.  To  perform  the  corresponding 
operation  from  the  large  file  requires  .250/i-secs.,  or  two  instruction  cycles. 

As  an  example,  use  the  benchmark  eight-nearest  neighbor  non-linear 
context  array,  where  k  =  l.  Allow  all  50  of  the  Gps  to  be  stored  in  bulk 
memory.  The  total  time  spent  accessing  the  Gps  is: 

000  *i~sec3  #  of  non-zero  Gp(=50)  _  ^  /i~secs. 

Gp  pixel  pixel 

Only  half  of  this  time,  however,  represents  additional  processing  time  over 
fetching  the  Gp  and  its  corresponding  context  array  from  the  large  file.  Thus, 
the  additional  processing  time  required  to  process  a  Gp  stored  in  bulk  memory 
is  12.5/i-secs  per  pixel.  When  this  is  compared  to  the  137,000  /a-secs./pixel 
required  for  classification,  this  time  represents  a  negligible  cost.  In  the  cases 
where  there  are  more  classes,  this  ratio  will  become  more  negligible. 

2.3.5.  Processing  of  Images  in  Bulk  Memory 

If  an  image  is  small,  data  vectors  may  be  stored  in  the  large  file.  This  was 
the  method  used  to  acquire  the  timings  presented.  For  actual  images,  however, 
the  large  file  is  too  small  to  hold  the  image  data.  Pixel  measurement  vectors 
can  be  stored  in  bulk  memory.  There  is,  however,  an  additional  cost  associated 
with  reading  pixel  measurement  vectors  from  bulk  memory.  Pixel  data  is 
represented  as  a  one-by-four  vector  of  32-bit  floating  point  numbers.  It  was 
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earlier  stated  that  a  16-bit  read  from  bulk  memory  requires  the  same  amount 
of  time  as  a  32-bit  read  from  the  large  file.  Thus,  reading  a  32-bit  number  from 
bulk  memory  will  require  twice  as  much  time  as  a  corresponding  read  from  the 
large  file.  Reading  a  data  vector  from  the  large  file  will  require  four  instruction 
cycles,  or  .5  fi— secs, /pixel.  Reading  the  same  data  from  bulk  memory  will 
require  an  additional  processing  time  of  four  instruction  cycles,  or 
.5/i-secs./pixel.  This  is  minimal  when  compared  with  the  137, OOOjr-secs. /pixel 
processing  time  associated  with  the  eight  nearest-neighbor  contextual  classifier. 


2.3.6.  A  16  FP  System 

Consider  the  problem  of  using  N  (<16)  FPs  together  to  do  contextual 
classification  with  a  square  size  nine  neighborhood.  Assume  the  image  data  is 
stored  in  the  bulk  memories.  The  approach  taken  is  to  divide  the  image  among 
the  FPs  using  the  “striping”  method  (Fig.  2.3.2. 1).  Each  FP  classifies  the 
pixels  in  its  own  subimage.  Because  the  p-array  is  non-linear,  FPs  will  have  to 
communicate  to  share  subimage  edge  data  [SwS80].  For  example,  to  classify 
the  bottom  row  of  FP  0’s  subimage,  information  about  the  pixels  in  the  top 
row  of  FP  l’s  subimage  is  needed  (i.e. ,  the  neighborhood  window  crosses 
subimages  boundaries).  Thus,  some  way  to  achieve  this  sharing  is  necessary. 

The  speed  at  which  the  contextual  classifier  runs  depends  on  the  floating 
point  algorithms  which  are  implemented  in  the  software.  This  can  cause  a 
bottleneck  in  the  processing  if  one  FP  is  required  to  wait  for  another. 
Synchronization  can  require  large  amounts  of  time  if  the  full  16  processor  array 
is  used,  since  at  each  step,  the  slowest  FP  will  determine  the  execution  time. 
Thus,  asynchronous  processing  at  the  instruction  level  is  necessary. 
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An  FP  is  capable  of  addressing  up  to  three  channels  of  16-by-128K  bytes 
of  bulk  memory  each  [CDC77a],[CDC77bJ.  The  sharing  of  bulk  memory  is  a 
scheme  that  can  be  used  for  transferring  data  among  FPs.  One  possible 
implementation  is  shown  in  Fig.  2.3.6. 1.  Bus  0  of  FP  i  will  be  shared  with 
FP  i — 1,  while  bus  1  will  be  local  to  FP  i,  and  bus  2  will  be  shared  with 
FP  i  +  1.  An  FP  will  be  allowed  to  address  only  half  of  its  L  memory  banks  at 
one  time.  This  is  done  to  facilitate  double  buffering.  The  other  L/2  memory 
banks  will  be  accessible  by  the  host.  This  allows  the  FP  to  classify  one  image 
while  the  host  unloads  and  stores  the  results  of  the  previous  classification  and 
then  loads  the  next  image  to  be  processed. 

Assume  each  FP  will  classify  the  pixels  in  I/N  rows  (Fig.  2.3.2. 1).  If 
border  areas  are  stored  in  the  shared  memory  banks,  a  processor  will  begin 
processing  in  banks  of  bus  1.  Processing  will  continue  through  half  of  the  L/2 
banks  in  bus  1  to  bank  0  on  bus  2.  After  all  the  data  in  the  banks  on  data  bus 
2  have  been  processed,  processing  will  continue  to  the  banks  on  bus  3. 

Allowing  25 %  of  FP  i’s  data  to  be  stored  in  the  shared  banks  on  bus  1, 
50%  of  the  data  to  be  stored  in  the  local  banks  on  bus  2,  and  25%  of  the  data 
to  be  stored  in  the  shared  banks  on  bus  3,  no  contention  will  occur.  Consider 
that  for  processor  i  to  “catch  up”  with  processor  i  +  1,  processor  i  will  have  to 
process  more  than  75%  of  its  data  in  the  time  that  it  takes  processor  i  +  1  to 
process  25%  of  its  data.  Thus,  contention  is  not  a  problem. 

When  an  image  is  divided  by  the  striping  scheme,  all  non-linear  windows 
will  require  FPs  to  share  data.  In  particular,  for  the  case  of  an  A-by-A  window, 
(A-l)  rows  of  “compf”/pixel  values  must  be  commonly  accessible  by  adjacent 
FPs.  This  is  shown  in  Fig.  2. 3. 6. 2.  Assuming  that  an  FP  classifies  all  pixels 
in  its  subimage,  that  the  pixel  to  be  classified  is  in  the  middle  of  the  window, 
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Fig.  2.3.6. 1  A  potential  FP  architecture  for  image  processing 
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and  that  A  is  odd,  FP  i  (i>0)  will  require  the  (A-l)/2  bottom  rows  of  data 
from  the  subimage  of  FP  i-1  to  classify  the  top  row  of  its  subimage  (in 
addition  to  the  (A— 1)/2  rows  of  data  from  its  own  subimage).  In  addition,  FP  i 
will  require  the  (A-l)/2  top  rows  of  data  from  the  subimage  of  FP  i  +  1  to 
classify  the  last  row  in  its  subimage.  Once  the  “compf”  values  for  a  given 
pixel  are  calculated,  they  do  not  change.  Thus,  if  FP  i  calculates  the  “compf” 
values  for  the  (A-l)/2  bottom  rows  of  pixels  from  the  subimage  that  “belongs” 
to  FP  i-1  and  stores  those  “compf”  values  and  the  “compf”  values  for  the  top 
(A~l)/2  rows  of  its  subimage  in  shared  bulk  memory,  FP  i-1  will  not  need  to 
recalculate  the  “compf”  values  for  those  pixels.  While  FP  i  is  calculating  the 
compf  values  for  the  bottom  (A-l)/2  rows  of  data  from  the  subimage  of  FP 
i-1,  FP  i  +  1  is  calculating  the  “compf”  values  for  the  (A~l)/2  bottom  rows  of 
data  from  the  subimage  of  FP  i.  When  FP  i  classifies  the  bottom  (A-l)/2  rows 
of  its  subimage,  the  needed  “compf”  values  will  have  already  been  calculated 
by  FP  i  +  1.  Thus,  to  classify  the  bottom  (A— 1  )/2  rows  of  data  from  a  given 
subimage,  FPs  will  not  need  to  calculate  any  “compf”  values,  as  they  are 
already  stored  in  either  the  hold  array  or  in  the  shared  bulk  memory.  There  is 
little  possibility  that  one  processor  will  require  data  before  it  is  ready.  For  a 
processor  to  require  such  data,  it  would  have  to  process  (I/N)-((A~l)/2)  rows 
of  its  data  in  the  same  time  that  another  processor  would  have  had  to  classify 
less  than  (A— I )/2  rows  of  its  data. 


2.3.7.  Processing  of  Large  Images 

Assume  that  an  FP  system  is  configured  as  previously  described.  If  the 
image  to  be  processed  will  fit  into  bulk  memory,  the  image  can  be  processed 
according  to  the  “striping  scheme’’  discussed  earlier.  There  is,  however,  another 
problem  that  can  arise.  An  image  may  be  too  large  to  fit  in  the  bulk  memory. 

Assume  that  there  are  L'  bulk  memory  banks  per  FP  for  data,  separate 
from  the  bulk  memory  banks  for  the  Gps,  there  are  N  FPs  and  that  a  three- 
by-three  neighborhood  is  being  classified.  If  an  image  will  not  fit  into  the 
N*L'  /2  bulk  memory  banks,  the  host  will  transmit  only  the  leftmost 
unprocessed  columns  of  the  image  that  will  fit  into  N*L'  /2  bulk  memory  banks 
at  a  time,  L'/2  banks  per  FP.  While  the  FP  is  processing  one  subimage  in  one 
half  of  its  memory,  the  host  can  be  loading  the  next  subimage  into  the  other 
half  of  the  bulk  memory.  This  will  overlap  the  FP  operation  with  the  host’s 
operation.  If  an  image  and  its  associated  data  can  fit  in  N*L'  memory  banks, 
it  is  still  beneficial  to  use  the  striping  scheme,  as  this  will  facilitate  the 
preloading  of  the  next  image  to  be  processed.  Fig.  2.3.7.1  is  an  example  of  how 
an  image  is  divided  and  processed.  The  FPs  process  subimages  from  left  to 
right.  Each  subimage  will  be  processed  as  described  in  Section  2.3.6.  The 
stored  class-conditional  densities  (“compf”  values)  for  the  rightmost  two 
columns  of  data  must  be  saved,  as  they  are  needed  to  process  the  next 
subimage.  These  columns  of  data  will  be  stored  in  one  of  the  L'  memory 
banks.  This  memory  bank  will  not  be  accessed  by  the  host,  as  it  will  contain 
the  “compf’’  values  necessary  for  the  FP  to  process  the  next  subimage.  The 
exception  to  this  rule  is  the  last  subimage.  Since  the  FP  will  have  no  further 
processing,  it  is  not  necessary  to  save  these  values.  Neither  the  first  nor  the  last 
column  an  FP  processes  will  be  classified,  as  there  is  insufficient  context 


information. 
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Since  the  floating  point  operations  require  variable  amounts  of  time,  an 
FP  processing  its  portion  of  the  image  may  finish  before  the  rest  of  the 
processors.  With  the  FPs  running  asynchronously,  it  is  theoretically  possible 
for  a  given  FP  eventually  to  get  two  subimages  ahead  of  its  neighboring  FPs. 
Subimage  edge  data  would  be  destroyed  for  the  neighboring  FPs  if  the  host 
were  to  load  new  data  into  the  shared  memory  banks  before  the  neighboring 
two  FPs  had  finished  with  the  old  data.  To  prevent  this  from  happening,  after 
an  FP  processes  two  subimages,  it  must  wait  for  the  other  FPs  to  finish. 

When  an  FP  finishes  writing  results  into  a  bank  of  bulk  memory,  it  signals 
the  host  to  read  all  necessary  data  from  that  memory  bank,  even  though  an 
adjacent  FP  will  need  to  read  data  corresponding  to  the  subimage  edge  pixels 
from  that  bulk  memory  bank  to  process  the  next  subimage.  Since  a  read  is 
non-destructive,  the  host  reading  from  bulk  memory  will  not  hamper  an  FP 
reading  from  the  same  bulk  memory  bank.  All  FPs  accessing  a  given  bulk 
memory  bank  must  set  flags  in  bulk  memory  before  the  host  can  write  to  this 
bank.  This  will  prevent  the  host  from  overwriting  data  that  is  still  in  use.  As 
was  stated  in  Section  2.3.2,  with  20  non-zero  Gps,  a  single  FP  classifier  took 
.035  secs,  to  classify  a  single  pixel.  Reading  a  pixel  measurement  vector  from 
bulk  memory  will  require  4.0  /i- secs..  Most  of  the  execution  time  is  spent  in 
mathematical  calculations,  not  fetching  data,  so  any  possible  contention  will 
have  a  negligible  effect  on  pixel  processing  time. 


2.3.8.  Summary 

In  summary,  the  organization  of  the  FP  system  given  above  will  allow 
contention-free  sharing  of  data.  This  means  that  N  FPs  will  be  able  to  operate 
approximately  N  times  faster  than  one  FP.  Furthermore,  the  double-buffering 
of  the  bulk  memories  will  allow  the  loading  of  images  to  be  processed  and 
storage  of  results  to  be  overlapped  with  the  classification  operation  of  the  FPs. 

2.4.  SIMD  Implementations  on  PASM 

2.4.1.  Introduction 

PASM  is  a  dynamically  reconfigurable  multimicrocomputer  system  whose 
design  will  support  as  many  as  1024  processors  [SiS81].  SIMD  implementations 
of  contextual  classifiers  based  on  PASM  are  discussed  in  the  next  section. 
First,  a  brief  overview  of  PASM  is  presented,  limited  to  those  aspects  of  PASM 
that  are  needed  to  understand  the  SIMD  algorithms  that  follow. 

2.4.2.  Overview  of  PASM 

Fig.  2.4.2. 1  is  a  block  diagram  of  PASM.  The  heart  of  the  system  is  the 
Parallel  Computation  Unit  (PCU),  which  contains  N  processors,  N  memory 
modules,  and  the  interconnection  network.  The  PCU  processors  are 
microprocessors  that  perform  the  actual  computations.  The  PCU  memory 
modules  are  used  by  the  PCU  processors  for  data  storage  in  SIMD  mode. 
When  a  PCU  processor  is  combined  with  a  PCU  memory  unit,  it  is  referred  to 
as  a  Processing  Element  (PE).  The  interconnection  network  provides  a 
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Fig.  2.4.2  Block  diagram  of  PASM  [SiS81] 


means  of  communication  among  the  PCU  processors  and  memory  modules. 
PASM  uses  data  conditional  and  PE  address  masks  to  activate  and  deactivate 
PCU  processors  in  SIMD  mode. 

The  processors,  memory  modules,  and  interconnection  network  of  the  PCU 
are  organized  as  shown  in  Fig.  2.4.2.2.  A  pair  of  memory  units  is  used  for  each 
PCU  memory  module  so  that  data  can  be  moved  between  one  memory  unit 
and  the  secondary  storage,  while  the  PCU  processor  operates  on  data  in  the 
other  memory  unit.  Each  PCU  memory  unit  may  be  as  large  as  64K  16-bit 
words.  Two  choices  being  considered  for  the  network  are  the  Generalized  Cube 
[SiM81b]  and  Augmented  Data  Manipulator  [SiM81a).  Their  relative  merits  are 
currently  under  study  [McS82|. 

The  Micro  Controllers  (MCs)  are  a  set  of  microprocessors  which  act  as 
the  control  unit  for  the  PCU  processors  in  SIMD  mode.  Control  Storage 
contains  the  programs  for  the  MCs.  Each  MC  memory  module  consists  of  a 
pair  of  memory  units.  This  allows  programs  and/or  common  data  to  be  moved 
between  Control  Storage  and  one  MC  memory  unit,  while  the  MC  is  using  the 
other  memory  unit. 

The  Memory  Management  System  controls  the  loading  and  unloading 
of  the  PCU  memory  modules.  It  employs  a  set  of  cooperating  dedicated 
microprocessors.  The  Memory  Storage  System  is  the  secondary  storage  for 
these  files.  Multiple  devices  are  used  to  allow  parallel  data  transfers.  The 
System  Control  Unit  is  a  conventional  machine,  such  as  a  PDP-11,  and  is 
responsible  for  the  overall  coordination  of  the  activities  of  the  other 
components  of  PASM. 


The  approach  taken  to  contextual  classification  using  PASM  in  SIMD 
mode  is  different  from  that  for  the  FP  system,  since  the  processors  are 
synchronized  and  there  is  no  directly-wired  shared  memory.  There  are  three 
main  differences  between  the  FP  and  SIMD  implementations.  First,  it  is 
technologically  feasible  to  construct  a  multimicroprocessor  SIMD  machine  with 
many  more  than  16  processors.  Second,  there  are  differences  in  computational 
capabilities,  i.e.,  16  FPs  may  be  faster  than  32  microprocessors.  Third,  in  SIMD 
mode,  the  program  is  stored  in  the  control  unit  (MCs),  which  broadcasts  it  to 
the  PCU  microprocessors.  The  control  unit  also  stores  the  Gp  array,  decoding 
and  broadcasting  each  element  as  needed.  In  the  FP  system,  each  FP  stores  a 
copy  of  the  program  and  Gp  array. 

2.4.3.  Linear  Contextual  Classification  on  PASM 

Consider  using  PASM  to  implement  the  contextual  classifier  based  on  a 
horizontally  linear  neighborhood  of  size  three.  If  the  image  to  be  classified  is  a 
typical  LANDSAT  [NAS72]  frame  (I =3250, J  =2340),  776  PEs  will  be  assigned 
7427  pixels  and  248  PEs  will  be  assigned  7426  pixels.  Classification  is 
accomplished  by  having  each  of  the  PE’s  execute  the  serial  algorithm  of 
Section  2.2.2  simultaneously.  For  example,  all  PEs  first  calculate  the  “compf” 
values  for  their  pixels.  This  is  done  simultaneously  in  all  PEs.  where  the  248 
PEs  assigned  7426  pixels  will  be  disabled  for  the  last  PE  operations.  All  PEs 
will  then  send  their  neighbor  the  “compf”  values  that  need  to  be  shared.  By 
extending  the  previously  discussed  striping  scheme  to  include  a  non-integer 
number  of  rows  assigned  to  each  PE,  this  task  division  is  realizable.  The 
modified  striping  scheme,  shown  in  Fig.  2. 4. 3.1,  requires  2C  additional  network 


transfers  over  the  original  striping  scheme  for  sharing  “compf”  valuer  between 
adjacent  PEs.  This  cost  is  negligible  when  compared  to  the  classification  time 
of  7426  pixels.  Each  of  the  interconnection  networks  under  consideration  for 
PASM  can  perform  each  of  the  2C  required  data  transfers  in  one  pass  through 
the  network,  where  each  transfer  involves  N  PEs  i.e.,  when  PE  i  is  transferring 
data  to  PE  i — 1,  PE  i-1  is  transferring  data  to  PE  i— 2,  etc.  On  PASM,  a  PE 
will  get  an  instruction  to  send  another  PE  the  shared  data.  This  differs  from 
the  FP  system,  where  an  FP  gets  the  data  it  needs  on  its  own.  The 
asynchronous  nature  of  the  FP  system  makes  this  modification  to  the  striping 
algorithm  less  efficient  on  the  CDC  system. 

An  image  may  be  so  large  that  not  all  of  the  data  will  fit  into  the  PCU 
memory  space  allocated.  The  double-buffered  memory  modules  can  be  used  so 
that  as  soon  as  the  data  in  one  memory  unit  are  processed,  the  processor  can 
switch  to  the  other  unit  and  continue  executing  the  same  program.  When  the 
processor  is  ready  to  switch  memory  units,  it  signals  the  Memory  Management 
System  that  it  has  finished  using  the  data  in  the  memory  unit  to  which  it  is 
currently  connected.  The  processor  switches  memory  units,  assuming  that  the 
data  is  present,  and  then  checks  a  data  identification  tag  to  ensure  that  the 
new  data  are  available,  ihe  Memory  Management  System  can  then  unload  the 
“processed”  memory  unit  and  load  it  with  the  next  subimage.  For  both  the 
one-by-three  linear  window  and  the  three-by-t hree  nonlinear  window,  this 
scheme  will  require  some  mechanism  to  allow  the  “compf”  values  for  the  last 
two  columns  of  a  subimage  in  a  given  memory  bank  to  be  available  when  the 
associated  processor  switches  to  the  next  memory  unit. 

One  method  of  doing  this  maintains  a  copy  of  local  data  in  both  memory 
units  associated  with  a  given  processor,  so  that  switching  memory  units  does 


not  alter  the  local  variable  storage  associated  with  the  processor  [SiS81].  In 
essence,  this  technique  makes  use  of  the  conventional  store  through  techniques, 
as  described  in  [Hay78].  This  scheme  would  be  used  only  when  multiple 
subimages  are  to  be  processed. 

The  time  required  to  classify  a  LANDSAT  frame  is  the  same  as  the  time 
required  for  each  PE  to  classify  7427  pixels.  If  each  PE  were  to  classify  7427 
pixels,  7,605,248  pixels  would  be  classified,  representing  a  speedup  of  1024.  For 
a  3250-by-2340  image,  PASM  will  classify  7,605,000  pixels  in  the  same  time. 
This  is  99.99790  of  the  theoretical  improvement  of  1024. 

2.4.4.  Non-Linear  Contextual  Classification  on  PASM 

Consider  implementing  a  three-by-three  non-linear  contextual  classifier  on 
PASM.  The  I-by-J  image  is  divided  into  N  subimages.  Each  PE  will  be  assigned 
an  (I/\/N)-by  { J / -n/N)  array  as  shown  in  Fig.  2.4.4. 1.  If  I  is  non-divisible  by 
v/N,  some  PEs  will  have  to  process  (I/\/N)  +  l  rows  of  data,  while  others  will 
have  to  process  I/\/N.  Similarly,  if  J  is  non-divisible  by  \/N,  some  PEs  will 
have  to  process  (J/\/N)  +  l  columas  of  data  instead  of  J/\/N.  In  all  cases,  the 
PE's  processing  the  smaller  amount  of  data  will  be  disabled  while  the  remaining 
PEs  continue  processing.  All  of  the  PEs  will  execute  the  algorithm  discussed  in 
Section  2.2.  Each  PE  can  classify  all  the  pixels  in  its  subimage  which  are  not 
on  the  subimage  edges.  All  PEs  can  do  this  simultaneously.  To  classify 
subimage  edge  pixels,  the  PEs  must  share  data  by  passing  information  through 
the  interconnection  network.  For  example,  in  order  for  PE  0  to  classify  pixel 
(0.(J/>/N)-l)  it  needs  to  get  the  “compf”  values  for  pixel  (0,J/\/N)  from  PE  1. 
Both  networks  under  consideration  can  perform  each  of  the  nearest  neighbor 


inter-PE  transfer  operations  in  one  pass  through  the  network. 

One  way  to  share  “compf”  values  among  PEs  is  to  have  each  PE  first 
compute  and  store  the  “compf”  values  for  its  edge  pixels  in  a  vector  called 
EDGE.  (Later,  when  a  PE  needs  the  “compf”  values  for  these  pixels  in  order 
to  classify  pixels  in  its  own  subimage,  they  are  read  from  EDGE,  not 
recomputed.)  Each  PE  sends  copies  of  these  values  to  the  appropriate 
“adjacent”  PE.  A  PE  saves  the  value  it  receives  in  a  vector  OUTEREDGE. 
Each  PE  accesses  its  own  OUTEREDGE  vector  when  it  is  ready  to  classify  its 
edge  pixels.  This  method  requires  only  ((2(I  +  J)/>/N)  +  4)C  parallel  data 
transfers.  For  each  of  the  required  transfers,  the  networks  being  considered  for 
PASM  will  allow  all  PEs  to  perform  the  transfer  simultaneously.  A 
checkerboard  division  of  the  image  was  used  since,  in  general,  it  requires  fewer 
inter-PE  transfers  than  dividing  the  image  by  rows  or  columns.  For  arithmetic 
operations  and  “compf”  calculations,  a  perfect  factor  of  N  speedup  is  attained. 
This  is  done  at  the  “cost”  of  ((2(1  +  J)/>/N)  +  4)C  inter-PE  transfers.  These 
data  transfers  are  negligible  when  compared  with  the  I*J*C/N  “compf” 
computations. 

2.5.  Conclusions 

Based  on  simulated  results,  timings  for  contextual  classification  on  an  FP 
system  have  been  presented  and  discussed.  A  potential  system  configuration  for 
the  FP  system  has  been  presented,  and  its  use  discussed.  For  comparison, 
timings  have  been  presented  for  contextual  classification  on  a  PDP-11/70.  It 
was  found  that  a  PDP  11/70  runs  at  a  slightly  slower  speed  than  a  single  FP 
on  the  contextual  classification  algorithms  examined.  Further,  it  was  shown 


that  N  FPs  could  execute  contextual  classification  almost  N  times  as  fast  as 
one  FP.  Thus,  the  multiprocessor  parallelism  of  an  FP  system  can  be 
successfully  exploited. 

It  was  shown  that  N  processors  in  the  SIMD  mode  of  operation  could 
accomplish  contextual  classification  almost  N  times  faster  than  one  processor  of 
the  same  type.  In  particular,  an  SIMD  algorithm  for  PASM  to  perform  the 
computationally  intensive  task  of  contextual  classification  was  presented. 

The  FP  and  PASM  approaches  could  be  combined  [SmS82].  A 
multimicroprocessor  SIMD  machine  with  shared  memories  (as  in  the  FP 
approach)  and  no  interconnection  network  would  be  an  efficient  special-purpose 
system  for  performing  contextual  classification  with  various  size  and  shape 
neighborhoods. 

Thus, 'through  the  use  of  parallel  computer  systems,  such  as  PASM  and 
CDC  FPs,  the  types  of  computations  required  for  contextual  classifiers  and 
other  computationally  demanding  remote  sensing  processes  can  be  implemented 
efficiently.  This  will  not  only  reduce  the  computation  time  required  to  do 
contextual  classification,  but  will  also  allow  the  investigation  of  techniques 
which  may  otherwise  be  considered  infeasible. 


CHAPTER  3 


PARALLEL  PROCESSING  CONCEPTS  FOR 
REMOTE  SENSING  APPLICATIONS 


3.1.  Introduction 

Multispectral  image  data  collected  by  remote  sensing  devices  aboard 
aircraft  and  spacecraft  are  relatively  complex  data  entities.  Because  of  the 
multispectral  nature  of  remote  sensing  image  data,  vectors  are  used  to 
represent  the  data.  The  execution  of  even  the  simplest  classification  algorithms 
may  require  large  amounts  of  computation  time.  Thus,  in  order  to  allow 
complex  classification  algorithms  to  become  more  feasible,  special  hardware 
(such  as  the  previously  discussed  parallel  architectures)  to  increase  the 
execution  speed  is  of  interest. 

For  many  remote  sensing  tasks,  all  pixels  in  a  given  image  are  treated  in  a 
similar  fashion.  This  implies  that  the  same  numerical  operations  are  done  on  all 
pixels.  Thus,  the  same  instructions  are  performed  on  multiple  data  sets.  It 
would  appear  that  SIMD  machines,  such  as  those  discussed  in  Chapter  1,  are 
particularly  well-suited  to  these  tasks.  Further,  since  images  as  large  as  3250- 
by-2340  pixels  [NAS72]  are  common,  a  system  that  has  as  many  as  1024 
processors  would  be  well-suited  for  image  processing  tasks.  Large  scale 
integration  makes  just  such  parallel  systems  possible. 


The  applications  of  such  a  machine  to  image  processing  tasks  is  the  topic 
under  consideration  here.  Section  3.2  introduces  a  potential  machine 
architecture.  Sections  3.3,  3.4,  3.5,  and  3.6  discuss  how  such  a  system  can  be 
applied  to  smoothing,  maximum  likelihood  classification,  contextual 
classification,  and  image  correlation,  respectively.  The  fault  tolerance  of 
MuRSS  is  discussed  in  Section  3.7.  Enhancements  to  the  MuRSS  architecture 
to  increase  fault  tolerance  are  presented  in  Section  3.8,  where  the  fault 
tolerance  of  both  the  original  and  enhanced  systems  are  compared.  An 
overview  of  MPP,  the  Massively  Parallel  Processor  (an  already  existing 
architecture)  is  presented  in  Section  3.9,  along  with  a  discussion  comparing 
MPP  to  the  enhanced  MuRSS  system  in  the  areas  of  performance,  capabilities, 
and  fault  tolerance. 

3.2.  Machine  Architecture 

The  proposed  SIMD  architecture,  Multimicroprocessor  Remote  Sensing 
System  (MuRSS),  is  shown  in  Fig.  3.2.1.  The  system  consists  of  N  +  l 
processing  units  (PUs)  numbered  from  0  to  N  and  2N  +  2  memory  modules 
numbered  from  0  to  2N  +  1  (Fig.  3.2.2).  During  normal  operation,  N  PUs 
(numbered  0  to  N-l)  and  2N  memory  modules  (numbered  0  to  2N-1)  will  be 
used  (Fig.  3.2.3).  PU  number  N,  memory  module  number  2N,  and  the  wrap¬ 
around  connection  are  for  fault  tolerance. 

Each  PU  will  be  a  commonly  available  microprocessor,  such  as  a  68000 
[Mot80]  equipped  with  a  floating  point  unit  and  will  be  connected  to  four 
busses  in  addition  to  its  own  private  bus.  The  private  bus  will  be  connected  to 
the  PU’s  private  memory  which  will  contain  such  things  as  local  variables  and 
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monitor  routines.  One  of  the  remaining  four  busses  will  be  used  to 
communicate  with  the  control  unit,  while  the  other  three  busses,  numbered  0,1, 
and  2  (Fig.  3.2.4),  will  be  connected  to  banks  of  memory.  Two  of  these  busses 
will  be  connected  to  “shared”  memory  banks.  Thus,  these  busses,  and 
consequently  the  associated  memory  banks,  will  also  be  connected  to  adjacent 
processors.  This  will  allow  data  to  be  shared  among  adjacent  PUs  for  window 
based  operations,  like  the  contextual  classifier  discussed  in  Section  2.2.  (Note 
that  the  2  bus  of  PU  N  will  share  its  memory  with  the  0  bus  of  PU  0  for 
reasons  discussed  later).  The  third  bus  will  be  connected  to  a  “local”  memory 
bank.  Each  of  the  three  busses  of  a  PU  can  address  up  to  28  64K-byte  banks 
of  memory. 

It  would  appear  that  direct  PU-to-PU  intercommunication  could  occur 
through  the  shared  busses.  This  is  not  possible  because  MuRSS  is  an  SIMD 
architecture  with  no  special  latching  hardware  on  the  shared  busses.  Since  all 
the  PUs  must  either  read  or  write  simultaneously,  data  cannot  be  shipped  from 
PU-to-PU  without  some  form  of  latch  (like  the  shared  memory).  Thus,  PU-to- 
PU  intercommunication  must  be  done  through  the  shared  memory.  (Such 
latches  could  be  added  to  the  design,  but  for  the  applications  investigated  thus 
far,  the  use  of  the  shared  memory  for  communication  appears  to  be  sufficient.) 
Therefore,  the  memory  banks  that  are  “shared”  can  be  used  to  store  common 
data  for  a  PU  and  its  linearly  adjacent  neighbor,  eliminating  the  need  for  a 
more  complex  interconnection  structure  when  performing  window-based 
processing  operations. 

Memory  contention  is  not  a  problem,  as  the  only  way  contention  can  occur 
is  if  two  processors  try  to  access  the  same  shared  memory  banks.  This  cannot 
happen  with  this  SIMD  system,  since  whenever  processor  I  is  using  its  0  bus, 


processor  1-1  must  also  be  using  its  0  bus  (it  cannot,  for  example,  be  using  its  2 
bus)  (Fig.  3.2.4).  For  the  purposes  of  this  discussion,  the  memories  (either 
directly  or  indirectly)  associated  with  busses  0  and  1  of  PU  I  will  be  said  to  be 
associated  with  PU  I.  In  general,  memory  modules  21  and  21  +  1  will  be 
associated  with  PU  I,  shared  memory  module  21  with  bus  0  and  local  memory 
module  21  +  1  with  bus  1. 

It  is  possible  that  the  shared  memories  may  be  needed  to  store  local  data, 
e.g.,  when  there  is  too  much  local  data  for  the  local  memories  to  handle.  In 
this  case,  only  the  memory  addressable  by  the  busses  associated  with  each 
processor  (i.e.,  bus  0  and  bus  1)  should  be  used  to  store  local  data.  Thus,  for 
PU  I,  memory  module  21  should  store  data  to  be  shared  with  PU  1-1  and  any 
local  data  that  will  not  fit  into  memory  module  21  +  1.  Memory  module  21  +  1 
should  be  used  to  store  the  majority  of  local  data  for  PU  I.  Memory  module 
21  +  2  should  not  be  used  for  data  local  to  PU  I. 

This  requirement  is  not  a  rigid  requirement,  i.e.,  when  all  2N  +  1  memory 
banks  are  working,  PU  I  could  use  memory  modules  21,  21  +  1,  and  21+2  for 
local  data;  however,  if  even  one  memory  bank  fails,  algorithms  not  satisfying 
this  requirement  cannot  be  executed  by  MuRSS. 

The  organization  of  the  memory  is  shown  in  Fig.  3.2.5.  This  fig;ure 
assumes  that  there  are  L  memory  banks  associated  with  each  bus.  The 
memory  associated  with  MuRSS  will  be  dual  ported,  allowing  a  given  memory 
bank  to  be  connected  to  two  busses  simultaneously.  One  bus  will  be  connected 
to  a  MuRSS  PU,  while  the  other  bus  will  be  connected  to  the  host.  This  will 
allow  the  host  to  address  the  memories  separately  from  the  processors,  enabling 
the  host  to  load/unload  data  into/from  half  the  banks,  while  the  processor 
operates  on  data  from  the  other  half,  maximizing  overlap.  This  type  of  overlap 


is  called  double  buffering  and  is  similar  to  the  approaches  taken  with  the 
CDC  FP  system  in  Section  2.3.6  and  with  the  PASM  system  in  Section  2.4.3 
[SiS81].  Double  buffering  can  be  implemented  in  hardware,  allowing  the 
memory  to  be  addressed  contiguously,  simplifying  the  loading  and  unloading  of 
data.  If  the  addresses  associated  with  the  memory  banks  (as  viewed  by  the 
host)  are: 


where  the  Half  indicates  which  half  of  the  double  buffer  is  to  be  addressed,  the 
PU  number  is  the  number  of  the  associated  PU,  and  the  Bus  bit  is  the  bus  to 
be  addressed  (0  =  left,  l=center).  When  a  fault  occurs,  the  CU  can  re-program 
the  PU  numbers,  so  the  remaining  memory  can  be  treated  as  contiguous  by  the 
host  (this  is  discussed  further  below).  If  all  memory  banks  are  attached  to  a 
bus  that  is  accessible  by  the  host,  the  host  can  view  the  memories  as 
contiguous,  each  PU  is  associated  with  29  GlK-byte  memory  banks,  many 
processors  will  not  be  able  to  directly  this  much  memory  (>  232  memory 
locations),  so  the  host  may  need  to  use  some  form  of  memory  controller.  Some 
memory  controllers  may  allow  a  special  microprogram  to  be  installed  to 
facilitate  handling  the  memory  organization. 
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Consider  the  procedures  that  the  host  must  perform  to  address  the  pixel 
(i,j)  in  an  R  row  by  C  column  image  consisting  of  b-byte  elements.  Assume  the 
data  is  stored  in  column  major  format,  i.e., 
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If  each  P(J  has  the  same  number  of  columns  of  data,  then  pixel  (i.j)  is  in  PU  P: 


This  looks  very  complex,  but  these  calculations  must  be  done  only  once  per 
column.  Further,  if  many  columns  of  data  are  to  be  loaded/unload  into/from 
the  memory  units,  the  following  algorithm  can  be  applied: 


int  P;  /*  PU  counter  */ 

int  B';  /*  Base  address  of  array  in  shared  memory  */ 

int  B”;  /*  Base  address  of  array  in  local  memory  */ 

int  C’;  /*  Columns  of  data  stored  in  shared  memory  */ 

int  C”;  /*  Columns  of  data  stored  in  local  memory  */ 

int  N;  /*  PUs  in  use  */ 

for  (  P  =  0  ;  P  <  N  ;  P  =  P  +  1  )  {  /*  for  each  processor  */ 

/*  completely  unload  bus  0  of  Processor  P  */ 
read  (b*R  bytes  from  address  B’  of  bus  0  of  PU  P); 

/*  completely  unload  bus  1  of  Processor  P  */ 
read  (b*R  bytes  from  address  B"  of  bus  1  of  PU  P); 

} 

This  type  of  scheme  is  particularly  convenient  if  a  memory  controller  is 
used  and  the  memory  controller  can  perform  Direct  Memory  Access  (DMA)  to 
and  from  the  host’s  memory.  If  DMA  is  used,  the  above  algorithm  for 
unloading  data  from  an  N  =  1024  MuRSS  would  require: 

2048  block  reads 
1024  compares  and 
1024  additions. 

Further,  if  the  entire  “half  0”  or  “half  1”  of  the  memory  banks  are  to  be 
read/written,  only  one  read/write  (of  size  235  bytes)  would  be  needed.  These 
transfers  could  occur  between  MuRSS  and  the  host’s  secondary  memory  or  the 


int  P; 

/*  PU  counter  */ 

int  B'; 

/*  Base  address  of  array  in  shared  memory 

int  B"; 

/*  Base  address  of  array  in  local  memory  ■ 

int  C'; 

/*  Columns  stored  in  shared  memory  */ 

int  C"; 

/*  Columns  stored  in  local  memory  */ 

int  bR; 

/*  Bytes  of  data  per  column  */ 

int  N; 

/*  PUs  in  use  */ 

int  i; 

/*  Row  counter  */ 

int  j; 

/*  Column  counter  */ 

for  (  i=0 

;  i  <  R  ;  i=i  +  l)  {  /*  each  row  */ 

for  (  P=0  ;  P  <  N  ;  P=P  +  1  )  {  /*  each  processor  */ 

for  ( j  =  B'  ;  j  <  bRC’  ;  j=j  +  bR  )  {  /*  shared  columns  */ 

/*  unload  one  data  item  from  bus  0  of  PU  P  */ 

0 

read  (b  bytes  from  address  j  of  bus  0  of  PU  P); 

} 

for  (  j  =  B”  ;  j  <  bRC  ;  j=j+R”  )  {  /*  local  columns  */ 
/*  unload  one  data  item  from  bus  I  of  PU  P  */ 
read  (b  bytes  from  address  j  of  bus  1  of  PU  P); 

} 

} 

} 


This  algorithm  represents  a  significant  number  of  calculations  on  the  part 


of  the  host.  With  the  large  number  of  individual  reads,  each  which  takes  time 
to  create  a  system  buffer,  it  is  less  cumbersome  for  the  host  to  unload  the 
image  in  column  format  and  transpose  the  image  in  its  own  memory. 


If  the  image  is  loaded  in  row  major  format,  the  algorithms  are  similar,  but 
rows  and  columns  are  reversed.  Similarly,  for  such  a  scheme,  it  is  simple  for 
the  host  to  deal  with  row  data  and  complex  for  the  host  to  deal  with  column 
data.  Given  that  an  image  is  treated  consistently  (i.e.,  not  transposed  during 
loading  or  unloading),  MuRSS  can  handle  data  in  either  row  major  or  column 
major  format  without  excessive  processing.  For  example,  consider  the  image  in 
Fig.  2.3.2. 1.  Here,  each  PU  would  hold  an  entire  stripe  I/N-by-J  pixels  large, 
effectively  processing  the  image  in  row  major  format.  The  shared  data  in  Fig. 
2. 3. 6.2,  as  required  for  classification  of  non-linear  windows,  would  be  stored  in 
the  shared  memories. 

Consider  an  image  stored  in  column  major  format.  Define  the  relative 
index  of  the  pixel  (i,j)  to  be  the  row  and  column  of  the  pixel  relative  to  the 
uppermost  left  pixel  in  the  PU’s  address  space.  In  an  image  stored  in  column 
major  format,  the  absolute  pixel  (i,j)  would  have  relative  address  (i,j’),  where  j’ 
is  the  number  of  columns  to  the  right  of  the  leftmost  column  addressable  by 
the  PU.  Thus,  if  each  PU  could  address  ten  columns  of  data,  the  relative 
address  (1.0)  would  correspond  to  the  N  pixels  whose  absolute  addresses  were 

(1,10  x  k)  k=0,l,2,...,N-l  1  Typically,  if  C’  columns  of  data  were  stored  in 

the  shared  memory  associated  with  bus  0  of  PU  I,  then  C’/2  pixels  would  be 
processed  by  PU  1-1  and  C’/2  pixels  would  be  processed  by  PU  I,  as  was  done 
for  the  FP  system  discussed  in  Section  2.3.6.  This  means  that  PUs  will 
typically  start  their  processing  for  the  pixels  with  relative  address  (0,C'/2). 
For  pixels  with  relative  address  (i,j '),  if  there  are  C’  columns  of  data  associated 
with  busses  0  and  2  and  C”  columns  of  data  associated  with  1,  the  bus  can  be 
determined  as  follows: 


bus  = 


1  C'  <  j'  <  C'  +  C" 


2  else 


The  address  of  the  pixel  (within  the  bus)  is: 

bx(j'  xR+i) 


address  = 


b  x  ((j'  -  C' )  x  R  +  i) 

[  b  x  ((j'  -  C'  -  C" )  x  R  +  i) 


bus  0 
bus  1 
bus  2 


Addressing  within  a  given  column  requires  setting  a  pointer  to  the  base  address 
of  the  column  and  incrementing  or  decrementing  it  by  a  fixed  amount.  If 
(N  >  28)  and 

The  CU  will  be  a  special  purpose  processor.  It  will  be  equipped  with 
memory,  in  which  it  will  store  its  program,  global  data,  the  program  to  be 
broadcast  to  the  PUs,  and  its  local  variables.  The  amount  of  memory  is 
variable  and  is  a  function  of  cost  and  the  processor  chosen  for  the  CU. 

The  host  will  be  assumed  to  be  a  computer  such  as  an  IBM-370  or  a  PDP- 
11  series  machine.  All  support  operations,  such  as  formatting  input  and 
formatting  output,  will  be  performed  by  the  host. 

Each  PU  is  based  on  the  Motorola  68000  microprocessor.  From  [Mot81],  a 
12.5  MHz  68000  can  perform  a  16-bit  integer  addition  in  400  nsec.  The  1024 
68000’s  in  MuRSS  can  perform  2560  million  integer  additions  per  second.  In 
addition,  MuRSS  equipped  with  Motorola's  high  speed  floating  point  software 


can  perform  73  million  32-bit  floating  point  additions  per  second  or  36  million 
32-bit  floating  point  multiplications  per  second.  When  the  PUs  are  equipped 
with  the  planned  16.666  MHz  MC68881  floating  point  processor,  MuRSS  is 
capable  of  367  million  32-bit  floating  point  additions,  330  million  32-bit  floating 
point  multiplications,  or  270  million  32-bit  floating  point  divisions  per  second. 
All  floating  point  operations  are  in  accordance  with  the  IEEE  floating-point 
specification  P754. 


3.3.  Smoothing  on  a  Parallel  SIMD  Machine 


Smoothing  is  a  method  of  noise  reduction  for  image  data.  The 
measurement  vector  for  each  pixel  is  replaced  by  the  average  of  the 
measurement  vector  for  that  pixel  and  the  measurement  vectors  of  the  eight 
surrounding  pixels.  Consider  the  following  example,  as  shown  in  Fig.  2.2.1  1(b). 
x ,  j ,  the  measurement  vector  for  pixel  (i,j)  is  replaced  by: 


x  i.i  " 


_  (xi-i.i-i'fxi.i-i+xi-n.i-i+xi-i,i  +  xi.i+xi+ii+xi_li  +  l+xii  +  1+xi  +  lij  +  i) 


Thus,  for  each  pixel,  eight  vector  additions  and  one  division  of  a  vector  by  a 
constant  is  required.  Consider  the  case  where  each  measurement  vector  is  4- 
dimensional  and  the  image  is  I-by-J  pixels.  Smoothing  the  image  on  a  serial 
machine  will  require  8*1* J  vector  additions  and  I*J  divisions,  translating  to 
32*1* J  additions  and  4*1* J  divisions. 


If  I  is  sufficiently  large  (>  2N  +  1)  and  a  multiple  of  N.  the  image  can  be 
divided  into  N  rows  I/N  pixels  high  as  shown  in  Fig.  2.3.2. 1.  This  scheme  is 
called  striping  and  has  been  discussed  in  Section  2.3.2.  Each  processor  will 


that  at  least  two  rows  of  data  will  have  to  be  stored  in  shared  memory.  For 


example,  with  a  512-by-512  image  and  32  processors,  processor  0  will  process 
rows  0  to  15,  while  processor  1  will  process  rows  16  to  31,  etc.  Memory  0  will 
store  rows  0,  memory  1  will  store  rows  1  through  14,  memory  2  will  store  rows 
15  and  16,  etc.  Note  that  memories  0  and  2  could  contain  more  rows  of  data. 
In  general,  up  to  two  rows  of  data  must  be  stored  in  each  shared  memory.  The 
rest  of  the  image  can  be  stored  in  the  local  memory  banks.  The  total 
processing  time  associated  with  an  image  is:  32*I*J/N  additions  and  4*I*J/N 
divisions.  Thus,  the  theoretical  maximum  speedup  by  a  factor  of  N  is  achieved. 

If  I  is  not  a  multiple  of  N,  all  processors  will  process  ll/Nj  rows,  then  I 
mod  N  processors  will  have  to  process  one  extra  row  of  data.  For  simplicity, 
assume  that  rows  cannot  be  subdivided.  Thus,  some  processors  will  have  to 
process  a  stripe  ll/NT]  rows  wide,  while  other  processors  will  have  to  process  a 
stripe  fl/Nl  rows  wide.  If  each  row  is  J  pixels  wide,  the  total  processing  time 
associated  with  a  given  image  will  be: 

32*J*([l/NT])  additions 
4*J*(Il/Nl)  divisions 

This  represents  an  increase  of  at  most  32*  J  additions  and  4*J  divisions  over 
the  ideal  case.  The  efficiency  of  the  above  implementation  can  be  represented 
by  the  ratio  of  the  time  required  for  an  ideal  speedup  to  the  actual  processing 
time  [SiS82b] .  This  translates  to: 


v*. 


The  worst  case  efficiency  is  achieved  when  one  processor  is  running  while  the 
remaining  processors  are  idled.  Mathematically,  this  is  when  the  difference 
between  I/N  and  [i/Nl  is  a  maximum.  For  example,  with  N=1024  and  an 
image  with  4097  rows,  this  represents  an  efficiency  of  80%,  while  for  1=65537, 
this  represents  an  efficiency  of  98.4%.  The  larger  the  image,  the  closer  the 
efficiency  is  to  100%. 

Note  that  the  efficiency  is  a  function  of  the  number  of  rows.  Processing 
columns  instead  of  rows  will  make  the  efficiency  a  function  of  the  number  of 
columns  and  may  allow  N  processors  to  operate  more  efficiently.  An 
alternative  to  the  above  method  is  to  use  the  “modified  striping”  scheme 
discussed  in  Section  2.4.3. 

The  time  required  to  smooth  an  image  using  modified  striping  is: 

32*[l*J/Nl  additions 

4*  fl*J/Nl  divisions 

For  the  ideal  speedup  of  N,  the  ceiling  function  would  be  absent,  thus  the  ratio 
of  the  ideal  speedup  to  the  actual  speedup  becomes: 

I*J/N 

[l*J/Nl 

For  N  =  1024,  and  an  image  of  size  l025-by-4097,  the  efficiency  is  99.99  +  %. 


This  method,  thus,  leads  to  a  higher  overall  utilization  of  the  processors. 
Further,  for  images  greater  than  2N-by-2N,  the  utilization  is  independent  of 
the  orientation  of  the  image,  i.e,  whether  the  image  is  striped  based  on  rows  or 
columns. 

If  edge  data  is  to  be  handled  differently  than  data  internal  to  the  image, 
when  one  or  more  processors  reach  an  edge,  all  other  processors  must  be 
disabled.  The  remaining  processors  then  process  their  edge  data.  This  is  not 
required  in  the  simple  striping  scheme,  as  all  the  processors  reach  an  edge  at 
the  same  time.  In  a  modified  striping  scheme  (with  horizontal  stripes),  the 
probability  that  a  given  processor  is  processing  an  edge  pixel  is: 


In  addition,  each  PE  must  decide  (for  each  pixel  it  processes),  whether  that 
pixel  is  an  edge  or  non-edge  pixel.  The  modified  striping  scheme  requires 
2*(fl*.J/N'])  more  comparisons  and  a  maximum  of  pedge*  fl * J/ N 1  more  edge 
pixel  computations  than  the  simple  striping  scheme  in  the  ideal  case  where  I  or 
J  divides  N.  Simple  striping  requires  at  most  2  more  edge  pixel  computations 
and  1-2  more  internal  pixel  computations  than  simple  striping  in  the  ideal  case. 
The  striping  scheme  to  be  used  should  minimize  the  number  of  computations 
above  the  ideal  case. 

Images  smaller  than  2N  rows  have  Dot  been  considered,  as  they  do  not 
have  enough  row's  to  utilize  the  full  machine.  Each  processor  will  have  to  store 
at  least  one  row  of  data  in  each  of  its  shared  memory  banks.  This  implies  that 
there  are  at  least  two  rows  of  data  per  processor.  Multiplication  of  the  two 
row  minimum  by  the  N  processors  yields  2N  rows.  If  striping  is  done  by 
columns,  then  the  argument  is  similar.  To  process  small  images  (using  rows), 


fl/2l  processors  would  have  to  be  enabled,  while  the  rest  of  the  processors  were 


disabled  for  the  entire  task. 


3.4.  Maximum  Likelihood  Classification 

Maximum  likelihood  classification  (MLC)  [SwD78]  classifies  each  pixel 
independently  of  all  others.  Assume  that  the  input  data  can  be  described  by  a 
Gaussian  distribution  function  [SwD78j.  Thus,  the  probability  that  pixel  (i,j)  is 
in  a  given  class  u>k  t  Cl  =  {cJ[,cj2>  '  ‘  ‘  wn}  *s: 


P(Xij|  wk) 


1  -y(XirMk)TEk-1(XirMlt) 

>/2n"|Ek|  ' 


where  Xjj  is  the  measurement  vector  for  pixel  (i,j),  Mk  is  the  mean  vector  for 
class  k,  Sk  is  the  covariance  matrix  for  class  k.  A  pixel  is  assigned  to  a  given 
class  such  that  p(X,j[u;k)  is  maximized.  It  is  possible  to  use  a  discriminant 
function  [SwD78]: 


d(X: 


-  —  I  In 


+  (Xij-mk)TSk1(Xirmk) 


Maximizing  this  last  discriminant  function  for  Xjj  over  Vt  will  yield  the  same 
result  as  maximizing  p(X;j|u;k)  over  the  same  fl.  The  discriminant  function  is 
considerably  less  complex  to  calculate  than  the  probability,  so  discussion  is 
based  on  the  discriminant  function. 


The  calculation  of  -In  2Jk|  and  Ek  1S  done  once  i°r  eac“  information  class 

and  is  negligible  when  compared  to  the  calculation  of  the  discriminant  function 
for  each  class  for  each  pixel  in  a  given  image.  Again  assuming  Xjj  is  4- 
dimensional,  X;j-mk  can  be  done  in  four  additions  per  class  per  pixel.  By 
utilizing  the  symmetry  of  E^1,  (X;j-mk)Ek 1(Xij-mk)  can  be  performed  in  20 
multiplies  and  9  additions  for  the  four  spectral  band  case.  Thus,  the 
calculation  of  the  discriminant  function  will  require  20  multiplies,  15  additions, 
and  one  sign  change  per  pixel  per  class.  Finally,  for  C  class  data,  C-l 
compares  per  pixel  will  be  needed  in  addition  to  the  calculation  of  the 
discriminant  function.  On  an  I-bv-J  image,  classification  of  all  I*J  pixels  will 
require  20*I*J*C  multiplications,  15*I*J*C  additions,  and  I».J*(C-1)  compares 
for  a  standard  serial  processor. 

Consider  implementing  the  MLC  on  MuRSS.  The  Cl’  will  broadcast  class 
dependent  constants,  such  as  Ek*  and  mk  as  part  of  the  SIMD  program.  Each 
pixel  is  classified  independently,  thus  there  is  no  need  for  any  inter-processor 
communication.  Using  the  modified  striping  scheme  to  divide  the  I- by- J  image. 
N  Pi’s  will  be  able  to  perform  an  MLC 

I*J/N 

¥vSTxN 


times  faster  than  a  single  PU.  Further,  since  this  operation  requires  no  inter- 
processor  data  transfers,  images  as  small  as  N  pixels  can  be  processed  without 


disabling  PUs  for  the  entire  operation. 


3.5.  Contextual  Classification 

The  “class"  associated  with  a  given  pixel  is  not  independent  of  the  classes 
of  adjacent  pixels.  Stated  in  terms  of  a  statistical  classification  framework, 
there  may  be  a  better  chance  of  correctly  classifying  a  given  pixel  if,  in 
addition  to  the  spectral  measurements  associated  with  the  pixel  itself,  the 
measurements  and/or  classifications  of  its  “neighbors"  are  considered  as  well. 
The  image  can  be  considered  to  be  a  two-dimensional  random  process 
incorporated  into  the  classification  strategy-.  This  is  the  objective  of 
“contextual  classifiers"  ([WeS71]  and  [SwV81]),  in  which  a  form  of  compound 
decision  theory  is  employed  through  the  use  of  a  statistical  characterization  of 
context.  Recent  investigations  have  demonstrated  the  effectiveness  of  a 
contextual  classifier  that  combines  spatial  and  spectral  information  by 
exploiting  the  tendency  of  certain  ground-cover  classes  to  occur  more 
frequently  in  some  spatial  contexts  than  in  others  [SwS80],  [VVeSTl],  [SwV81], 
and  [TiS81].  For  a  more  complete  description  of  contextual  classifiers,  please 
refer  to  Section  2.2.1. 

The  application  of  MuRSS  to  contextual  classification  is  a  straightforward 
extension  of  the  method  applied  in  Sections  2.3.2  and  2.3.3.  For  the  three-by- 
three  window,  data  allocation  and  timing  analysis  is  analogous  to  that  for 
smoothing.  The  main  difference  is  that  for  smoothing,  only  the  raw  pixel  data 
is  shared.  For  the  contextual  classifier,  the  “compf"  values  of  the  subimage 
edge  pixels  are  shared  instead.  The  parallel  processor  version  of  the  one-by- 
three  horizontally  linear  window  is  similar.  Other  sizes  and  shapes  of  windows 
can  be  handled  analogously. 


3.6.  Image  Correlation  on  a  Parallel  Machine 

Image  correlation,  as  described  in  [SiS82a],  is  used  to  measure  the  degree 
of  similarity  between  a  match  image  and  an  equal  sized  area  of  an  input  image. 
Typical  images  can  be  at  least  4096-by-4096  pixels,  with  match  areas  on  the 
order  of  64-by-64  pixels.  For  the  purposes  of  this  paper,  images  on  the  order  of 
65536-by-65536  pixels  will  be  considered. 

Let  the  symbols  x  and  y  denote  single  elements  of  arrays  X  and  Y,  where 
X  is  the  match  image  and  Y  is  the  area  of  the  input  image  under  consideration 
(same  dimensions  as  X).  Let  M  be  the  total  number  of  elements  in  the  match 
area  X.  Define: 

Sxx  =  (l/M)(y>2-(£x)2) 

Sxy  =  (l/M)(£xy-£x£y) 

Syy  =  (l/MHEyMEy)2) 

K  XY  =  Sxy  /  V^XX^YY 


S^y  is  the  covariance  of  the  match  area  with  a  portion  of  the  input  area.  Large 
positive  values  for  S^y  indicate  similarity  between  the  match  image  and  the 
input  image,  while  large  negative  values  for  Sxy  indicate  similarity  between  the 
negative  of  the  match  image  and  the  input  image.  Values  near  zero  indicate 
little  similarity  between  the  two  images.  Rxy  is  the  linear  correlation 
coefficient  of  the  statistics.  Simplistically  Rxy  is  a  normalized  version  of  Sxy  in 
which  Rxy  =  1  indicates  an  identical  match,  Rxy  =  ~1  indicates  2n  identical 
match  with  the  negative  of  the  input  area,  and  Rxy  =  0  indicates  no 


correlation  between  the  match  area  and  the  input  image.  A  correlation  value 


will  be  computed  for  each  position  in  which  the  match  image  can  fit  into  the  R 
row  by  C  column  input  image. 

The  calculation  of  Rxy  is  dominated  by  the  time  to  compute  V]xy,  Vy  , 
and  V]y2.  Vx  and  Vx2  do  not  change  from  input  window  to  input  window, 
and  can  thus  be  pre-computed.  For  a  match  template  with  r  rows  and  c 
columns,  each  Vxy  and  £]y2  requires  r*c  multiplications  and  rc-1  additions. 
Vy  requires  rc-1  additions.  These  operations  have  to  be  done  for  each 
position  of  the  match  template  in  the  input  image.  Special  methods  of 
computing  Vy2  and  Y] y  can  decrease  the  time  requirements  of  this  algorithm. 
Consider  the  following  algorithm  for  computing  the  sum  of  the  pixel  values 
(J]y ’s)  in  each  match  template. 

Assume  that  for  input  image  Y  the  position  of  the  match  area  is  defined 
by  the  coordinates  of  the  upper  left  hand  corner  of  the  match  area.  Define  a 
vector  “colsum'’  [SiS82a]  of  length  C  as: 

colsum(j)  =  £  Y(i,j) 
i  =  k 

where  k  is  the  row  coordinate  of  the  current  portion  of  the  match  area  and 
0  <  j<C.  Let  “SUM”  be  an  R-r  +  l-by-C-c  +  1  array,  where  SUMjj  is  the  sum 
of  the  pixels  of  the  input  image  for  the  match  area  position 
(i,j),  0  <  i  <  R-r+1,  0  <  j  <  C-c  +  1. 

Initially,  colsum  is  calculated  for  all  C  columns  of  row  0.  SUM(0,0)  is 
formed  by  summing  colsum(j)  (0<j<c-l).  This  requires  r*c  multiplications 
and  (r*c)-l  additions.  SUM(0,1)  is  formed  by  subtracting  colsum(O)  from 
SUM(0,0)  and  adding  colsum(c)  to  the  result.  In  general: 


SUM(0,j)  =  SUM(0,j-l)  -  colsum(j-l)  +  colsum(j  +c— 1) 


After  the  processing  of  a  given  row  is  complete,  colsum(j)  is  updated  for  the 
next  row  by  subtracting  Y(i,j)  from  the  old  colsum(j)  and  adding  Y(i  +  r-l,j)  to 
the  result.  This  changes  the  complexity  for  the  calculation  of  the  £]y’s  to:  3c-l 
additions/subtractions  per  template  position  for  the  column  0  entries  of  all 
other  rows,  and  4  additions/subtractions  per  template  position  for  all  other 
template  positions. 

For  a  typical  64-by-64  match  image,  straight  forward  computation  of  £y 
requires  4095  additions  per  match  template  position  on  the  input  image.  This  is 
the  same  number  of  operations  required  per  match  template  position  in  row  0 
of  the  input  image.  For  template  positions  in  column  0  of  the  other  rows,  191 
additions  are  required.  Computation  of  J]y2’s  *s  similar  to  the  computation  of 
the  xys- 

Consider  the  application  of  MuRSS  to  this  task.  Each  PU  will  apply  the 
serial  algorithm  to  its  assigned  pixels.  Pixels  will  be  assigned  to  PUs  based  on 
the  vertical  striping  scheme.  If  a  column  of  pixels  lies  in  memory  associated 
with  bus  0  or  bus  1  of  PU  I,  then  PU  I  is  responsible  for  the  computation  of 
the  colsum  and  the  analogous  y2  entries  associated  with  that  column.  If  the 
pixel  in  the  upper  left  hand  corner  of  a  window  lies  in  memory  associated  with 
bus  0  or  bus  1  of  PU  I,  then  PU  I  is  responsible  for  the  computation  of  that 
window.  When  PU  I  is  performing  computations  on  its  rightmost  c-1  columns, 
it  uses  the  colsum  values  stored  in  its  bus  2  memory  by  the  previous 
computations  of  PU  1  +  1  (recall  that  PU  I  +  l’s  bus  0  memory  is  PU  I’s  bus  2 
memory).  Thus,  at  least  c-I  colsum  values  and  the  corresponding  y  values 


For  an  R-by-C  image  and  N  PUs,  a  simple  vertical  striping  scheme  will 
assign  each  PU  a  subimage  either  R-by-fc/Nl  or  R-by-fc/N]  .  Thus,  the 
total  time  required  for  the  calculation  of  the  £)xy’s  is 
(R— r  + 1)*(  [c/Nl-c  +  l)*((r*c)— 1)  additions,  and  (R-r  +  l)*([C/N}-c  +  l)*r*c 
multiplications.  The  total  time  associated  with  the  calculation  of  the  £]y’s  is 
[(R-r)*((3*c)-l)l  +  [(  [c/n1— c)*((r*c)— 1)1  +  [(R-r)*(  fc/N]-c)*4]  additions. 
The  time  required  to  calculate  the  ]Cy2’s  is  similar  to  the  time  associated  with 
the  calculation  of  the  Vy's.  Extension  to  the  modified  striping  scheme  is 
similar  to  the  smoothing  case. 

If  C  <  N*(c-1),  then  c-1  columns  of  data  cannot  be  associated  with  each 
bus  0,  thus  the  PUs  cannot  all  be  enabled.  If  R  >  N*(r-1),  the  stripes  can  be 
horizontal  instead  of  vertical.  In  this  case,  r  and  c  are  swapped,  as  well  as  R 
and  C. 


3.7.  The  Fault  Tolerance  of  MuRSS 

The  throughput  of  a  MuRSS  is  limited  to  the  largest  number  of  adjacent 
working  (usable)  PUs.  Consider  a  simple  example  with  an  N=8  MuRSS  system 
(a  PU  fault  is  represented  by  BOLD  print  in  a  box). 


Physical:  012345678 


A  single  failure  leaves  eight  usable  PUs.  (If  there  were  no  wrap-around 


connection,  the  number  of  usable  PUs  would  be  seven.)  The  CU  can  alter  the 
PU  numbers  and  subsequently  the  numbers  associated  with  the  memory 
modules.  Thus,  for  the  above  fault  the  CU  would  renumber  the  PUs  to  (an  * 
indicates  an  unused  PU): 


Physical: 


0  1  2  3  4  5  6  7  8 


Locigal: 


1  2  3  4  5  6  7 


*  0 


Fault  detection  procedures  are  beyond  the  scope  of  this  work.  In  both  this 
section  and  in  Section  3.8.,  the  concern  is  with  fault  recovery  once  the 
existence  and  location  of  a  fault  is  known. 

When  a  MuRSS  processor  the  renumbered  MuRSS  PUs  start  with  logical 
PU  0  to  the  right  of  the  failed  processor.  The  numbers  continue  incrementing, 
through  the  wrap-around  connection,  ending  up  with  the  virtual  PU  N-l  on  the 
left  of  the  failed  processor.  When  a  local  memory  module  fails,  e.g.,  21  + 1,  it  is 
treated  like  a  fault  with  PU  I.  A  fault  in  a  shared  memory,  e.g.,  21  is  treated 
the  same  way. 

It  is  possible  for  the  faulty  processor  or  memory  module  to  fail  in  such  a 
way  that  that  adjacent  PUs  cannot  access  the  busses  shared  with  the  faulty 
unit.  In  such  a  case,  not  only  the  faulty  PlT  but  the  PU  associated  with  the 
inaccessible  shared  memory  module  would  be  unusable  because  of  the  inability 
to  access  shared  memory.  Thus,  this  would  be  handled  as  if  two  adjacent  PUs 


I 


failed.  This  is  a  special  case  of  the  multiple  failure  situation  discussed  later. 
A  multiple  failure,  such  as: 


Physical: 


reduces  the  number  of  usable  PUs  to  five.  (PUs  5  and  6  cannot  share  data 
with  adjacent  PUs,  and  subsequently  could  not  be  used  for  any  algorithm 
requiring  data  to  be  shared  among  PUs.)  If  either  PU  5  or  PU  6  or  both  were 
also  faulty,  the  same  number  of  usable  PUs  would  exist,  as  demonstrated 
below: 


In  such  an  event,  the  PUs  would  be  renumbered  to: 


Again,  an  *  indicates  an  idled  PU. 

A  fault  in  a  shared  memory,  e.g.,  21  is  treated  the  same  way.  Multiple 
memory  faults  associated  with  the  same  PU  I,  only  idle  PU  I.  Multiple 
memory  faults  associated  with  different  PUs  idle  their  associated  PUs  and 
subsequently  are  treated  like  multiple  PU  faults. 

It  was  previously  stated  that  if  any  local  data  for  PU  I  is  to  be  stored  in  a 
shared  memory  module,  that  it  should  be  in  memory  module  21.  This  is 
required  if  an  algorithm  is  to  be  run  on  a  system  with  a  single  fault  in  one  of 
the  shared  memory  modules.  If  this  rule  is  not  followed,  a  fault  in  a  shared 
memory  bank  would  require  the  two  PUs  attached  to  a  faulty  shared  memory 
module  to  be  disabled  instead  of  one,  decreasing  the  throughput  of  the  system. 

The  minimum  number  of  usable  PUs  in  an  N  PU  MuRSS  with  F  PU 

faults  (or  disabled  PUs)  can  be  expressed  by  the  equation: 

' 

N  F  =  0 

Usable  PUs  (min)  = 

1  <  F  <  N  +  l 

This  minimum  occurs  when  faulty  PUs  are  evenly  distributed  throughout  the 


system.  A  few  faults  can  seriously  cripple  MuRSS,  as  is  shown  in  Fig.  3.7.1.  It 
is  worthy  of  note,  that  this  is  a  worst  case  possibility.  If  the  failures  are  close 
together,  the  number  of  usable  PUs  will  be  greatly  increased.  For  example,  if 
the  faulty  Pi's  are  adjacent,  the  number  of  usable  PUs  is  N-F  +  1. 

3.8.  An  Enhanced  MuRSS 

To  minimize  the  degradation  of  MuRSS  in  a  multiple  fault  environment, 
consider  the  modifications  shown  in  Fig.  3.8.1.  The  wrap-around  connection 
between  PU  N  and  PU  0  is  the  same  as  before  (see  Fig.  3.2.2).  In  this  figure 
describing  the  Enhanced  MuRSS  (EMuRSS),  there  is  a  bypass  box 
associated  with  each  PU’s  shared  busses.  The  operation  of  the  bypass  boxes  is 
controlled  by  the  C'U. 

In  addition  to  the  bypass  boxes,  there  is  deselection  circuitry,  such  as  the 
SN74S214  [Uni78],  between  each  shared  memory  module  and  its  corresponding 
bus.  This  circuitry  will  be  used  isolate  faults  in  the  shared  memory  modules  so 
that  the  shared  busses  are  still  usable.  It  is  assumed  that  there  is  some  form  of 
isolation  hardware,  such  as  the  SN74S244  [l’ni78],  between  each  of  the  memory 
modules  (both  local  and  shared)  and  the  host  bus  to  prevent  a  memory  module 
from  failing  in  such  a  way  as  to  make  the  host  to  memory  module  bus 
unusable.  The  deselection  and  isolation  hardware  is  controlled  by  the  CU. 

The  effect  of  the  bypass  boxes  is  to  allow  the  system  to  reconfigure 
“around”  a  faulty  unit.  Consider,  an  N=8  EMuRSS  system  where  PU  7  is 


In  the  single  fault  case,  with  the  use  of  bypass  boxes  there  are  still  eight  usable 
PUs.  When  a  double  fault  occurs,  such  as  any  of  those  shown  in  Fig.  3.8.2., 
the  number  of  usable  PUs  is  seven,  because  the  use  of  bypass  boxes  allows  the 
connectivity  to  be  maintained.  In  a  normal  MuRSS,  the  number  of  usable  PUs 
would  be  6,  5,  4,  4,  5,  6,  7,  and  7  respectively.  Multiple  (more  than  two)  faults 
are  handled  similarly. 

The  two  modes  and  corresponding  effects  of  bypass  boxes  are  shown  in 
Fig.  3.8.3.  These  modes  allow  MuRSS  to  completely  bypass  a  faulty  PU.  When 
there  is  a  failure  in  a  PU  I  the  PU  is  bypassed  and  its  associated  shared 
memory  is  deselected.  It  is  assumed  that  the  bypass  box /deselection  circuitry 
can  isolate  any  faulty  hardware  from  the  shared  busses,  allowing  normal 
communications  to  take  place  between  the  two  processors  adjacent  to  the 
faulty  PU. 

The  CU  can  re-assign  the  PU  numbers,  allowing  the  PUs  and  their 
associated  memories  to  be  treated  like  they  were  contiguous.  As  was  used 
before,  the  PUs  have  a  physical  number  and  a  logical  number  The  logical  PU 
number  will  not  only  simplify  the  addressing  by  the  host,  but  will,  when 
combined  with  the  “wrap-around”  connection,  allow  the  system  to  handle  one 
complete  shared  bus  or  bypass  box  failure  with  no  degradation. 

If  a  single  PU  fails,  the  bypass  boxes  associated  with  its  shared  busses  are 
set  to  bypass  mode.  The  shared  memory  associated  with  its  bus  0  is 
deselected.  Disabling  the  faulty  PU  has  the  effect  of  disabling  its  local 


Normal  mode  -  PU  not  bypassed  Bypass  mode  - 


PU  bypassed 


Fig.  3.8.3  Two  modes  of  a  bypass  box 
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memory,  thus  contention  on  the  host  bus  is  not  a  problem.  The  logical  PU 
number  of  all  PUs  whose  physical  PU  numbers  are  greater  than  the  faulty  PU 
is  decremented  by  one,  as  is  shown  in  the  following  example: 


Physical: 


0 


8 


Logical: 


0 


6  7 


Physical  PU  N  (previously  disabled)  becomes  logical  PU  N-l.  When  a  memory 
module  (either  local  or  shared)  fails,  it  is  handled  exactly  like  a  fault  with  the 
associated  PU.  The  wrap-around  connection  is  not  used  when  there  is  a  fault 
with  a  single  PU  or  memory  module.  Multiple  faulty  PUs  are  handled 
similarly,  only  in  the  multiple  fault  case,  the  performance  is  degraded  as  there 
are  no  more  working  PUs  to  replace  the  faulty  units.  Multiple  faulty  shared 
and  local  memory  modules  are  handled  like  multiple  faulty  PUs. 

A  single  faulty  bypass  box  is  handled  using  the  wrap-around  connection. 
If  there  is  a  fault  with  one  of  the  bypass  boxes  associated  with  PU  I,  PU  I  is 
disabled.  PUs  with  physical  numbers  1  +  1  to  N  are  given  logical  numbers  0  to 
N-I-l  and  PUs  with  physical  numbers  0  to  1-1  are  given  logical  numbers  N-I 
to  N-l.  Using  the  wrap-around  connection  places  the  faulty  bypass  box  on  the 
logical  end  of  the  array,  where  it  and  its  associated  PU  (PU  I)  are  unused.  If,  in 
addition  to  a  single  faulty  bypass  box,  there  are  any  faulty  PUs  or  memory 
modules,  these  additional  faults  can  be  handled  as  described  in  the  last 
paragraph. 
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In  general,  multiple  faulty  bypass  boxes  break  the  connectivity  of 
EMuRSS.  It  is  assumed  that  a  bypass  box  failure  does  not  pull  down  a  shared 
bus.  If  it  does,  it  is  treated  the  same  as  a  shared  bus  failure.  Multiple  faulty 
bypass  boxes  have  the  same  result  as  multiple  PU  failures  in  MuRSS.  Thus, 
the  number  of  usable  PUs  is  less  than  N.  The  set  of  adjacent  usable  PUs  may 
or  may  not  use  the  wrap  around  connection. 

If  the  multiple  faulty  bypass  boxes  share  the  same  bus,  EMuRSS  can 
handle  two  faults  with  no  degradation.  This  is  shown  in  Fig.  3.8.4.  This  is  the 
same  situation  for  a  single  faulty  shared  bus,  i.e.,  the  bus  shared  by  PUs  1-1 
and  I  in  Fig.  3.8.5.  If  the  two  faulty  bypass  boxes  are  connected  to  the  same 
PU,  i.e.,  bus  0  and  bus  2  of  PU  I,  EMuRSS  can  handle  two  faults  with  no 
degradation.  This  is  shown  in  Fig.  3.8.6.  If  the  faults  are  on  contiguous 
busses,  e.g.,  PU  I’s  0  bus,  PU  I’s  2  bus,  and  PU  I-l’s  2  bus,  up  to  three  faults 
can  be  tolerated  with  no  degradation  in  performance.  This  is  shown  in  Fig. 
3.8.7. 

Multiple  faulty  busses  break  the  connectivity  of  EMuRSS.  This  situation 
is  the  same  as  the  case  for  multiple  faulty  bypass  boxes  previously  discussed. 

Since  mechanical  connections,  such  as  those  between  a  chip  and  a  bus,  are 
significantly  more  prone  to  failure  than  those  within  a  chip,  the  number  of 
mechanical  connections  can  give  a  fair  indication  of  the  probability  of  failure  of 
a  unit.  In  MuRSS,  both  shared  and  local  memory  busses  are  connected  to  28 
64-Kbyte  chips  and  each  chip  has  28  pins,  so  (including  the  64  pins  on  the 
68000)  there  are  a  minimum  of  14400  mechanical  chip  connections  that  can 
cause  a  fault  within  each  PU  and  its  associated  memories  (only  those  busses 
associated  with  a  PU  are  considered).  This  figure  is  clearly  conservative 
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because  a  failure  in  any  support  hardware  (e.g.,  the  CU)  will  also  cause  a  fault. 
For  simplicity,  only  chip  connections  (pins),  as  opposed  to  chip  connections  and 
bus  connections  (e.g.,  connections  from  busses  to  boards),  will  be  used  for  this 
discussion.  MuRSS  (N=1024)  has  14,745,600  mechanical  chip  connections. 

The  fault  bypass  circuitry  in  EMuRSS  consists  of  the  bypass  box,  the  bus 
performing  the  bypass,  and  shared  memory  unit  deselection  hardware.  Thus, 
there  are  chip  connections  to  the  CU,  PU,  shared  bus,  shared  memory, 
bypass  bus,  and  deselection  circuitry  that  can  fail.  The  connections  in 
bold  print  are  to  busses  with  26  connections  for  address,  8  connections  for  data, 
4  connections  for  signals,  and  2  connections  for  power  and  ground.  This 
comprises  200  connections.  The  CU  must  have  one  line  to  control  each  bypass 
box  and  one  line  to  control  the  memory  deselection  circuitry,  making  202 
mechanical  chip  connections  that  can  cause  a  fault.  The  processor/memory 
hardware  is  88  times  more  likely  to  fail  due  to  a  mechanical  connection  than 
the  bypass  circuitry.  EMuRSS  has  14,952,488  mechanical  connections.  This 
represents  an  increase  in  hardware  complexity  of  1.4  percent  over  the  non-fault 
tolerant  MuRSS,  which  is  a  trivial  change  in  the  complexity  of  the  system 
when  it  is  compared  to  the  additional  fault  handling  capability  of  the  system. 

The  1.4  percent  figure  does  not  accurately  represent  the  fault  tolerance  of 
the  EMuRSS.  A  fault  in  any  two  of  the  14,680,014  connections  will  yield  up  to 
half  the  system  unusable.  Thus,  these  connections  can  be  labeled  as  critical  to 
the  system’s  operation.  EMuRSS  has  204,800  connections.  This  represents  a 
significant  decrease  in  the  number  of  critical  connections. 

To  compare  the  performance  of  MuRSS  to  EMuRSS.  consider  the 
following  example.  If  UPWBB  is  the  number  of  Usable  PUs  in  an  N+l  PU 
MuRSS  With  Bypass  Boxes,  UPWBB  =  N  —  F  +  1,  where  F  is  the  number  of 
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faults  in  the  system  (1  <F <N  + 1).  (If  F  —0,  UPWBB— N.) 

For  F  <  N  +  l  faults,  the  number  of  Usable  PUs  in  a  system  with  No 

N 

Bypass  Boxes  (UPNBB)  would  be  no  less  than  —  .  The  benefit  of  the 

F 

bypass  units  is  demonstrated  in  Fig.  3.8.8,  where  UPWBB/UPNBB  is  graphed 
with  respect  to  F.  The  “sawtooth"  nature  of  this  graph  stems  from  the  floor 
function  in  the  definition  of  UPNBB.  At  no  time  is  UPWBB  less  than  UPNBB, 
but  for  an  N  =  1024  PU  system,  UPWBB  can  be  up  to  512  times  greater  than 
UPNBB. 

Thus  for  a  small  increase  in  hardware  complexity,  the  degradation  in  the 
system  performance  due  to  multiple  faults  can  be  significantly  reduced  (by  up 
to  a  factor  of  512  on  a  1024  processor  system). 

3.0.  MPP  --  A  Massively  Parallel  Processor 

For  the  basis  of  comparison,  consider  the  Massively  Parallel  Processor 
(MPP)  as  described  in  [Bat82]  and  [Bat80].  MPP  is  an  SIMD  machine  which 
was  designed  to  work  efficiently  on  a  variety  of  image  processing  tasks,  such  as 
correlation  and  multispectral  classification.  Fig.  3.9.1  is  a  block  diagram  of 
MPP,  which  illustrates  the  four  major  sub-units.  The  ARray  Unit  (ARU)  is 
the  unit  that  actually  contains  the  Processing  Elements  (PEs),  and  is  capable 
of  processing  arrays  of  data  at  high  speed.  Each  of  the  PEs  in  the  ARU 
performs  instructions  broadcast  by  the  Array  Control  Unit  (ACU)  on  data 
that  are  stored  in  local  memory. 

Logically,  the  ARU  consists  of  a  128-by-128  array  of  PEs.  Physically,  the 
ARU  contains  an  extra  128-by-4  array  of  PEs  for  fault  tolerance.  The  size  of 
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the  extra  128-by-4  array  was  determined  by  packaging  constraints.  The  bit- 
serial  nature  of  the  PEs  allows  MPP  to  perform  efficiently  on  operands  of  all 
lengths.  The  16,384  PEs  operate  instructions  on  16,384  bits  at  a  time,  which 
allows  for  a  very  high  processing  speed. 

Each  PE  in  the  128-by-128  array  communicates  with  its  four  nearest 
neighbors  in  a  fashion  similar  to  ELLIAC  IV  ([Bar68]  and  [Bou72]).  A  topology 
register  in  the  ACU  allows  the  user  to  software  select  what  happens  to  edge 
data  in  the  ARU.  Top-bottom  connections  in  the  ARU  are  handled 
independently  from  the  left-right  connections,  allowing  the  user  greater 
flexibility.  There  are  four  possible  connections  that  a  PE,  call  it  PEj  on  the 
right  edge  of  the  array  can  make  in  addition  to  connecting  to  PEiciminusl, 
PEIriminuslo8,  and  PEj*128.  where  ®  and  ciminus  are  modulo  16384  addition 
and  subtraction  respectively.  They  are: 

1)  open  (no  connection) 

2)  connect  to  PEi  +  1(PEl6383  has  no  connection) 

3)  connect  to  left  edge  PE  of  same  row 

4)  same  as  2),  but  connect  PE163g3  to  PE0 

The  connections  for  left  edge  PEs  correspond  to  these  connections.  Top-bottom 
connections  are  less  complex  than  the  left-right  connections,  in  that  the  top 
and  bottom  PEs  of  a  given  column  max  be  either  connected  or  left  open. 

Each  PE  in  the  ARU  contains  a  full  adder,  a  shift  register,  six  1-bit 
registers,  a  programmable  length  shift  register,  lK-bits  of  RAM.  a  data  bus, 
combinatorial  logic,  and  a  mask  register.  Fig.  3.9.2  shows  the  layout  of  the  PE. 
100ns  is  the  basic  cycle  time  for  the  PE;  however,  routing  operations  are 
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Fig.  3.9.2  MPP  PE  architecture  ( [Bat80j , [Bat82] ) 
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masked  independently  of  arithmetic  operations,  so  masked  routing  operations 
may  be  combined  with  unmasked  arithmetic  operations.  PEs  perform  the 
instruction  generated  by  the  ACU  on  the  data  stored  in  their  local  array.  Fig. 
3.9.3  is  a  block  diagram  of  the  ACU.  The  three  units  comprising  the  ACU  are 
the  I/O  control  unit  (which  manages  the  Sow  of  data),  the  PE  control  unit 
(which  performs  array  arithmetic  for  the  applications  program),  and  the  main 
control  unit  (which  performs  scalar  arithmetic  for  the  applications  program). 
Operations  of  each  of  the  units  are  overlapped  to  minimize  execution  time. 

The  Program  and  Data  Management  Unit  (PDMU)  controls  the  overall 
Bow  of  programs  and  data  in  the  system  (Fig.  3.9.1),  and  is  comprised  of  a 
DEC  PDP-11.  The  staging  memory  are  used  for  format  conversion  between  the 
incoming  data  and  the  data  to  be  processed.  Once  the  data  has  been  processed, 
it  is  returned  to  the  staging  memory,  where  additional  formatting  can  be 
performed. 

Through  its  massive  parallelism,  high  clock  rate,  and  its  functional 
overlap,  MPP  is  capable  of  performing  400  million  32-bit  Boating  point 
additions  per  second,  200  million  32-bit  floating  point  multiplications  per 
second,  or  3277  million  16-bit  integer  additions  per  second. 

It  is  difficult  to  compare  the  cost  of  EMuRSS  and  MPP.  since  MPP  is 
constructed  of  specially  designed  VLSI  chips  and  EMuRSS  would  not  be.  The 
complexity  of  this  comparison  is  compounded  by  tne  fact  that  hardware  costs 
change  so  rapidly.  Therefore,  the  comparison  wiil  be  limited  to  the  area  of 
processing  speeds,  fault  tolerance,  and  capabilities  of  MPP  and  a  1021 
processor  EMuRSS.  both  of  which  process  16,384  bits  at  a  time.  The  purpose 
of  this  comparison  is  to  highlight  the  differences  in  the  two  architectural 
approaches. 


EMuRSS  can  perform  3072  million  16-bit  integer  additions  per  second  and 
MPP  can  perform  3277  million  16-bit  integer  additions  per  second.  MPP  is  6 
percent  faster  than  EMuRSS.  EMuRSS  can  perform  307.2  million  16-bit 
integer  multiplications  (yielding  a  32-bit  result)  per  second.  MPP  can  perform 
1861  million  8-bit  integer  multiplications  (yielding  a  16-bit  result)  per  second 
and  902  million  12-bit  multiplications  per  second  (yielding  a  24-bit  result).  32- 
bit  data  was  not  available. 

In  terms  of  floating  point  operations  per  second,  MPP  outperforms  the 
EMuRSS  without  the  floating  point  processor  (both  MuRSS  and  EMuRSS  have 
the  same  processing  speeds).  The  cycle  time  for  MPP  is  100  nsec,  while  the 
cycle  time  for  EMuRSS  is  less  than  80  nsec,  (see  Section  3.2),  so  it  would  be 
intuitively  pleasing  if  EMuRSS  outperformed  MPP.  Both  processors  operate 
on  16.384  bits  of  information  at  a  time;  however,  in  all  cases  MPP  will  require 
the  minimum  number  of  cycles  for  a  given  operation  for  a  given  number  of  bits 
because  of  its  bit  serial  nature.  For  example,  a  typical  32-bit  floating  point 
format  consists  of: 

a  sign  bit  for  the  mantissa, 

an  8-bit  2's  complement  exponent,  and 

a  23-bit  mantissa. 

A  68000  can  perform  operations  on  16-bits  of  information  at  a  time,  so 
operations  on  the  23-bit  mantissa  require  the  same  wine  as  operations  on  a  32- 
bit  mantissa.  Operations  on  the  8-bit  exponent  require  the  same  time  as 
operations  on  a  16-bit  exponent.  Further,  the  EMuRSS  processors  have  to 
strip  out  unwanted  data  at  the  end  of  each  operation,  whereas  the  MPP 
processors  have  little  or  no  unwanted  data. 


The  specialized  floating  point  hardware  eliminates  much  of  the  overhead 
involved  with  the  handling  of  unwanted  data.  This  is  why  EMuRSS,  when 
equipped  with  the  floating  point  processor,  becomes  very  similar  in  performance 
to  MPP.  For  a  32-bit  floating  point  addition,  EMuRSS  is  9  percent  slower  than 
MPP,  but  for  a  32-bit  floating  point  multiplication,  EMuRSS  is  over  56  percent 
faster  than  MPP.  Further,  the  EMuRSS  specialized  floating  point  processor  has 
hardware  implementations  for  sine,  cosine,  and  tangent,  all  of  which  must  have 
custom  programs  written  for  their  calculations  on  MPP.  Further,  each  floating 
point  processor  is  independent  from  the  other  floating  point  processors  in 
EMuRSS,  i.e.,  they  are  not  synchronized.  Thus,  no  processor  must  be  idled  for 
any  point  in  time  during  these  calculations  to  wait  for  another  processor  to 
finish  a  calculation  whose  execution  is  data  dependent,  e.g.,  to  perform  a  cosine 
no  synchronization  is  required  during  the  intermediate  computations.  This 
makes  EMuRSS  even  more  competitive  with  MPP  because  using  the  algorithms 
in  [Har68],  there  are  conditional  instructions  that  are  required  for  the 
calculation  of  the  trigonometric  functions. 

To  be.  able  to  tolerate  a  single  fault  with  no  degradation  in  response  time. 
MPP  uses  an  additional  4-by-128  array  of  PEs.  A  one  PU  EMuRSS  equipped 
with  the  bypass  boxes  discussed  earlier  requires  one  additional  PU  to  be 
capable  of  withstanding  the  fault  of  a  single  PU  without  loss  of  processing 
speed.  Both  MPP  and  EMuRSS  require  some  form  of  bypass  hardware  to 
bypass  a  fault. 

MPP  and  EMuRSS  are  tolerant  to  a  single  fault.  MPP  is  not  tolerant  to 
multiple  faults,  unless  they  are  all  in  the  same  4-by-128  array  of  PEs  that  is 
bypassed.  The  number  of  usable  PUs  in  EMuRSS  is  one  more  than  N  minus 
the  number  of  failed  PUs  (since  an  N-PU  EMuRSS  has  one  spare  PU).  Thus. 


in  general,  in  the  event  of  a  multiple  fault,  EMuRSS  can  continue  to  be  used 
with  a  minimal  degradation  in  performance.  For  MPP,  there  is  no  provision 
for  operation  in  a  degraded  mode  when  multiple  faults  occur. 

Any  of  the  inter-processor  nearest  neighbor  communication  operations  that 
MPP  can  perform  can  also  be  handled  by  EMuRSS.  MPP  can  process  images 
by  assigning  one  pixel  to  each  PE,  or  by  dividing  the  image  to  be  processed 
into  square  neighborhoods  that  are  processed  by  the  PEs.  For  an  M-by-M 
image,  each  PE  would  hold  subimages  that  are  M/128  pixels  on  a  side.  An 
image  to  be  processed  by  EMuRSS  must  be  divided  into  stripes  extending  from 
the  top  of  the  scene  to  the  bottom.  Any  inter- row  communications  in  MPP  are 
internal  to  a  PU  in  EMuRSS.  Any  inter-column  communications  in  MPP  are 
either  internal  in  a  PU  in  EMURSS  or  are  between  adjacent  PUs  using  the 
shared  memory.  Adjacent  processors  will  process  adjacent  stripes. 

Both  EMuRSS  and  MPP  have  a  memory  organization  that  will  allow  an 
external  processor  to  store  information  in  one  order  and  the  processors  to  read 
the  information  in  another,  without  significant  processing.  MPP  uses  the 
staging  memory  to  perform  image  transformations  and  formatting  for  input 
and  output.  Because  of  the  way  the  EMuRSS  host  accesses  the  memory,  either 
row  or  column  format  data  can  be  loaded. 

Architecturally,  EMuRSS  and  MPP  differ  in  the  processor-to-processor 
connections.  EMuRSS  does  not  have  a  true  interconnection  network.  Instead, 
EMuRSS  implements  a  network  with  shared  memory  banks.  This  technique 
allows  memory  to  be  used  for  both  storage  and  communication,  meaning  that 
no  special  communication  protocol  is  necessary.  Data  transfer  is  treated  like  a 
memory  write. 


In  conclusion,  MPP  is  faster  than  EMuRSS  (with  the  floating  point 
hardware)  on  fixed  point  operations  and  some  floating  point  operations. 
EMuRSS  compares  reasonably  with  MPP  on  floating  point  multiplication  and 
division.  EMuRSS  has  a  hardware  unit  capable  of  performing  floating  point 
trigonometric  and  inverse- trigonometric  functions.  Because  the  floating  point 
units  are  not  run  in  lock-step,  for  any  floating  point  operation,  e.g.,  steps 
during  the  calculation  of  cosine,  EMuRSS  effectively  becomes  an  MIMD 
machine,  whereas  MPP  must  perform  these  operations  in  lock-step. 

Any  processor-to-processor  communication  that  is  required  for  an  MPP 
implementation  of  an  algorithm  can  be  handled  by  EMuRSS.  Both  MPP  and 
EMuRSS  can  handle  a  single  fault  with  no  degradation  in  performance; 
however,  only  the  fault-tolerant  EMuRSS  can  handle  multiple  faults  (with 
some  degradation). 


3.10.  Conclusions 

MuRSS,  an  SIMD  architecture  with  as  many  as  1024  processors,  was 
presented.  It  was  shown  that  N  processors  in  the  SIMD  mode  of  operation 
could  perform  various  context  independent  (e.g.,  maximum  likelihood 
classification)  and  window  based  (e.g.,  smoothing,  contextual  classification,  and 
image  correlation)  image  processing  tasks  almost  N  times  faster  than  one 
processor  of  the  same  type.  The  application  of  MuRSS  to  these  tasks  was 
discussed. 

Through  the  use  of  the  EMuRSS  SIMD  architecture,  computationally 
demanding  remote  sensing  processes  can  be  implemented  efficiently.  This  will 
not  only  reduce  the  computation  time  required  to  perform  remote  sensing 


tasks,  but  will  also  allow  the  investigation  of  techniques  which  may  otherwise 
be  considered  infeasible. 

Because  of  the  architecture  of  MuRSS,  multiple  faults  seriously  degraded 
its  performance.  The  architecture  of  MuRSS  was  altered  to  increase  MuRSS’ 
tolerance  to  faults,  creating  EMuRSS.  EMuRSS  was  then  compared  to  MPP  in 
the  areas  of  performance,  fault  tolerance  and  capabilities. 


CHAPTER  4 


MODELS  FOR  USE  IN  THE  DESIGN  OF 
SPECIAL  PURPOSE  MACRO-PIPELINED 
PARALLEL  PROCESSORS 

4.1.  Introduction 

For  certain  applications,  such  as  speech  processing,  time  is  an  important 
factor.  In  such  applications,  there  is  a  need  to  process  many  data  sets  in  the 
same  way  e.g.,  performing  an  FFT  for  every  frame  of  input  data.  Previous 
analysis,  such  as  that  performed  in  [Dem83,  TuA83,  YoS82,  Vic79],  shows  that 
for  many  types  of  tasks,  a  general  purpose  processor  is  not  sufficient.  In  this 
chapter,  an  approach  is  proposed  for  modeling  off  the  shelf  hardware  and  for 
modeling  parallel  algorithms,  along  with  a  design  methodology  to  use  the 
information  provided  by  these  models,  to  design  a  class  of  macro-pipelined 
special  purpose  parallel  architectures.  The  goal  is  to  use  models  such  as  the 
ones  proposed  here  to  develop  computer  aided  design  tools. 

Special  purpose  processing  systems  (such  as  those  used  for  dedicated  real¬ 
time  analysis)  are  typically  sold  in  small  quantities.  As  a  result,  the  cost  of  the 
design  can  make  the  resulting  system  prohibitively  expensive.  Computer  aided 
design  tools  for  this  process  would  reduce  the  cost  involved  and  are  therefore 
desirable. 
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This  chapter  uses  nine  parameters  to  correlate  the  hardware  to  be 
designed  to  the  applications  software  to  be  executed  and  the  I/O  environment 
in  which  the  machine  is  to  operate,  i.e.,  what  data  rates  the  machine  must 
handle,  the  format  of  the  incoming  data,  the  format  of  the  outgoing  data,  etc. 
A  macro-pipelined  layered  approach  to  task  decomposition  is  demonstrated. 
Each  portion  of  the  decomposed  task  in  a  scenario  is  then  assigned  to  a 
specifically  designed  special  purpose  processing  unit.  This  implies  that  each 
processing  unit  may  either  be  a  traditional  serial  type  design  or  a  parallel 
design.  Once  this  initial  decomposition  is  established,  techniques  such  as  those 
used  to  adjust  the  execution  time  and  throughput  of  a  pipeline  in  [HwB84]  can 
be  applied. 

In  this  approach  to  reaching  the  goal  of  automated  computer  design,  a 
functional  descriptions  (models)  of  the  hardware  components  that  may  be  used 
in  the  design  must  be  combined  into  a  database.  Included  in  such  a  database 
is  information  about  the  cost,  size,  power  consumption,  and  heat  dissipation  of 
the  device,  an  enumeration  of  all  the  operations  that  it  can  execute,  the 
pathwidth  and  execution  times  for  those  operations,  the  number  and  size  of  the 
registers,  and  a  simulation  routine  for  the  device.  More  complex  taxonomies, 
such  as  those  found  in  [Han77],  [Han81],  [H0J8I],  and  [Gil83]  are  not  needed  for 
the  database  because  they  specify  architectural  information.  Here,  only 
information  that  affects  the  processing  speed  of  the  unit  are  considered.  While 
the  architectural  information  provided  by  more  complex  taxonomies  can  yield 
similar  information,  handling  of  the  additional  data  is  cumbersome. 

The  information  in  the  database  will  be  used  to  select  the  “best"  hardware 
to  execute  a  given  algorithm.  As  suggested  in  [Gon78],  it  is  desirable  to 
establish  and  order  according  to  importance,  the  criteria  used  to  rank  designs. 


The  criteria  used  here  will  be  (in  order  of  importance):  speed  and  cost.  Speed 
refers  to  both  response  time  and  throughput.  The  response  time  is  the  time 
between  receiving  the  input  and  transmission  of  the  corresponding  result.  The 
throughput  is  the  number  of  data  sets  processed  per  unit  time.  Other  criteria 
might  include:  space,  power  requirements,  and  cooling  requirements. 

Using  information  about  each  sub-task  in  a  scenario,  a  specific  hardware 
organization  can  be  arranged  to  execute  the  required  algorithm  when  possible. 
Consider  a  task  that  is  composed  of  several  sub-tasks.  An  example  of  such  a 
task  might  be  isolated  word  recognition  [YoS82].  For  isolated  word  recognition, 
a  typical  processing  scenario  might  be:  digital  filtering,  autocorrelation  analysis, 
linear  predictive  coding  (LPC)  analysis,  linear  time  warping,  and  dynamic  time 
warping.  Each  of  these  processes  (sub-tasks)  represents  a  portion  of  the 
scenario.  An  example  of  the  scenario  is  in  Fig.  4.1.1.  Each  of  the  sub-tasks 
will  be  called  a  layer.  Using  information  about  each  sub-task,  a  special- 
purpose  architecture  can  be  developed  to  execute  the  sub-task  within  some 
time  and  cost  constraints.  The  special-purpose  hardware  that  is  assigned  to 
each  layer  will  be  called  a  level. 

For  the  present,  only  a  simple  scenario  (one  in  which  there  is  no  feedback) 
is  considered.  Initially,  the  sub-tasks  will  be  chosen  according  to  conceptual 
differences,  i.e.,  digital  filtering  is  different  from  autocorrelation  analysis,  so 
each  should  be  a  different  layer.  It  is  assumed  that  in  general,  conceptually 
different  portions  of  the  task,  i.e.,  the  sub-tasks,  require  different  hardware 
resources.  A  more  complete  discussion  of  the  application  of  such  a  design  to  an 
isolated  word  recognition  system  may  be  found  in  Section  4.6. 

It  is  the  goal  of  this  scheme  to  achieve  a  higher  throughput  by 
decomposing  a  scenario  into  layers.  Because  each  layer  requires  fewer 
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computations  than  the  entire  scenario,  connecting  the  levels  in  a  macro-pipeline 
and  pipelining  the  data  sets  through  the  machine  should  increase  the 
throughput  of  the  resulting  system.  This  type  of  parallelism  is  referred  to  as 
vertical  parallelism.  Since  each  layer  is  executing  on  specially  designed 
hardware,  which  may  consist  of  multiple  computational  units,  the  response 
time  of  the  resulting  system  is  decreased.  The  parallelism  occurring  within  a 
given  level,  where  multiple  units  are  performing  operations  on  different  portions 
of  the  data  set  simultaneously,  is  referred  to  as  horizontal  parallelism. 
Vertical  and  horizontal  parallelism  are  similar  to  the  techniques  of  subdivision 
and  replication  discussed  for  pipelines  in  [HwB84]  or  the  “purely  pipelined”  and 
the  “purely  parallel”  architectures  discussed  in  [\VoC84].  Throughput 
constraints  may  require  that  a  layer  be  further  divided  into  smaller  processes. 
These  will  not  represent  new  layers,  but  sub-layers,  which  will  correspond  to 
sub-levels  of  hardware,  consistent  with  the  previous  nomenclature. 

It  is  possible  to  sub-divide  the  layers  to  the  point  where  each  sub-level 
performs  exactly  one  instruction.  The  result  would  be  a  special  purpose, 
dedicated,  instruction-level,  data  flow  machine,  capable  of  performing  only  a 
single  task.  A  minor  alteration  in  the  program  would  require  an  alteration  in 
the  hardware.  For  all  but  the  least  complex  scenarios,  the  hardware  cost, 
would  be  overwhelming.  Analogously,  layers  can  be  combined  to  the  point 
where  one  level  performs  an  entire  task.  This  is  the  case  with  a  traditional 
serial  machine.  Presumably,  the  throughput  of  such  a  machine  would  be  too 
small. 

By  developing  a  method  to  transform  a  task  description  into  a  potential 
macro-pipelined  architecture,  a  machine  can  be  built  with  the  necessary 
characteristics  to  execute  the  task  quickly  and  without  excessive  amounts  of 


hardware.  A  basis  for  such  a  method  is  examined  in  Section  4.5.  A  similar  goal 
can  be  found  in  [WoC84],  where  the  goal  is  that  of  an  automated  tool  for 
planning  and  integrating  signal  processing  systems  in  a  distributed  computing 
environment.  [YVoC84]  examines  the  performance  of  a  system  to  satisfy 
requirements  for  throughput  and  robustness  with  respect  to  hardware 
allocation  strategies,  i.e.,  how  can  processors  be  added  or  deleted  from  a  system 
to  optimize  performance.  A  valuable  result  from  the  work  in  [WoC84]  is  the 
detailed  analysis  of  the  resulting  system.  These  techniques  can  also  be  applied 
to  load  balancing  between  processors.  The  type  of  systems  that  are  considered 
in  [\VoC84]  are  either  purely  parallel  (SIMD  or  MIMD)  [Fly66],  or  purely 
pipelined.  A  purely  parallel  system  corresponds  to  the  parallelism  within  a  level 
(horizontal  parallelism),  while  a  purely  pipelined  system  corresponds  to  the 
level  to  level  and  sub- level  to  sub- level  relationships  (vertical  parallelism). 
Thus,  this  research  is  a  useful  tool  in  the  analysis  of  both  the  macro  (level  to 
level)  and  the  micro  (within  a  level)  characteristics  of  the  system.  Here,  the 
major  concern  is  the  underlying  concepts  behind  a  model  relating  specific 
algorithms  to  the  requirements  they  place  on  hardware.  The  research  here 
expands  on  the  work  in  [WoC84]  by  allowing  both  forms  of  parallelism  at  any 
level. 

The  analysis  categories  in  [\VoC84]  can  be  applied  to  any  given  level  that 
contains  one  or  more  combinations  of  these  parallel  types.  This  will  allow  each 
level  to  be  designed  for  a  specific  sub-task,  having  a  special  hardware 
complement  to  more  quickly  execute  that  sub-task,  resulting  in  a  machine  that 
can  complete  a  processing  scenario  within  some  time  constraint.  For  the  case 
to  be  discussed  in  Sections  4.6  and  4.7,  the  time  constraint  will  be  that  the 
proposed  system  must  understand  isolated  words  in  real-time. 


It  is  the  goal  of  this  chapter  to  introduce  methods  of  modeling  hardware 
and  algorithms  so  that  an  accurate  estimation  of  the  execution  time  of  an 
algorithm  is  possible.  The  proposed  hardware  database  is  discussed  in  Section 
4.2.  Response  time  and  its  relation  to  the  system  hardware  is  considered  in 
Section  4.3.  Section  4.4  will  discuss  the  two  types  of  parallelism  and  their 
affect  on  the  overall  performance  of  the  system.  Section  4.5  will  present  nine 
parameters  and  discuss  their  relationship  to  the  hardware  of  the  corresponding 
level.  In  addition,  the  parameters  are  related  to  the  application  software  of  the 
corresponding  layer.  By  applying  both  of  these  relationships,  the  software  can 
be  related  to  the  hardware.  This  is  done  in  Sections  4.6,  4.7,  and  4.8,  where 
the  concepts  discussed  in  Sections  4.2  through  4.5  are  applied  to  an  isolated 
word  recognition  system. 

4.2.  The  Hardware  Database 

A  processor  description  in  the  database  consists  of  an  9-tuple,  a  6-tuple, 
and  a  set  of  three  N-tuples  and  three  N+  1-tuples,  where  IV  is  the  number  of 
assembly  language  instructions  (the  “  +  1”  includes  the  instruction  fetch  unit, 
which  can,  on  some  systems,  overlap  execution  with  certain  instructions).  The 
9-tuple  consists  of  the  processor  name,  cost,  package  size,  thermal  dissipation 
requirements,  power  requirements,  clock  speed,  data  pathwidth,  address 
pathwidth,  and  virtual  address  space.  The  package  size,  thermal  dissipation, 
and  power  requirements,  are  included  for  applications,  such  as  those  aboard  a 
satellite,  where  information  about  all  three  categories  may  be  crucial.  For 
some  processors,  such  as  the  PDP-11/70,  the  virtual  address  space  and  the  real 
address  space  differ,  so  both  are  required  for  specification  of  the  processor. 


The  5-tuple  consists  of  the  size  and  speed  of  on-board  cache,  the  size  and 
speed  of  on-board  memory,  and  the  number  and  size  of  the  registers.  The  N- 
and  N  + 1-  tuples  must  provide  information  about:  the  type  of  machine 
instructions,  the  execution  time  for  a  single  operation  for  each  instruction,  the 
number  of  stages  in  any  pipelines,  the  replication  of  units,  and  the  overlap  of 
operations.  The  tuples  corresponding  to  the  last  three  information  categories 
are  N  +  1-tuples  to  account  for  any  pipelining,  functional  overlap,  and 
parallelism  that  can  occur  within  the  instruction  fetch  unit.  By  combining  the 
information  contained  in  the  various  tuples,  it  is  possible  to  derive  a  precise 
estimation  of  the  execution  time  of  all  operations  whose  times  are  constant, 
(e.g.,  floating  point  operations  on  units  like  the  AMD9511A,  require  variable 
amounts  of  time  to  execute  the  same  operation  on  different  arguments,  thus 
only  an  estimation  or  expected  processing  time  may  be  derivable).  By 
combining  information  in  different  tuples,  much  information  can  be  gained. 
For  a  simple  example,  by  combining  the  number  of  stages  in  a  pipelined  unit 
with  the  single  operation  execution  time  of  the  unit,  it  is  possible  to  determine 
the  throughput  of  the  unit. 

Because  different  processors  have  different  instruction  sets,  it  is  logical  that 
N  not  be  the  same  for  all  processors.  Consider  the  case  of  a  simple  processor 
with  an  instruction  set  consisting  of  an  8-bit  add,  a  16-bit  add,  a  return  on 
zero,  a  move  memory  to  register  (8-bit),  and  a  move  register  to  memory  (8-bit). 
The  9-tuple  would  look  like: 

( BR  AND/MODEL.  S5. 00, 1. 5cm-by-3. Ocm.  1.5-BTU/hr, 

0. 15- \V,  1.3-mu-sec,  8-bits.  16-bits.  16-bits) 

Each  element  of  the  above  tuple  corresponds  to  the  above  enumeration  of  the 
elements  of  the  9-tuple.  For  a  simple  processor,  like  the  8085,  the  6-tuple 


would  be: 


(0,0,0,041  8-bit,  3  16-bit(S-bit))) 


The  four  “0’s”  show  that  there  is  no  on-board  cache  or  memory,  while  the 
parenthesized  quantity  associated  with  the  16-bit  register  width  shows  that  the 
three  16-bit  registers  can  be  addressed  in  8-bit  units.  For  a  processor  that  is 
capable  of  performing  the  above  instructions  (which  are  a  small  subset  of  the 
instruction  set  of  the  8085),  the  5-tuple  describing  the  capabilities  would  look 


(8-bit  add  register  to  register, 
16-bit  add  register  to  register, 
return  if  zero, 

8-bit  move  memory  to  register, 
8-bit  move  register  to  memory) 


Both  the  source  and  destination  of  each  operation  must  be  enumerated.  This 
allows  for  processors  (like  the  8085)  in  which  the  results  of  a  given  operation 
must  go  to  a  specific  place  (the  accumulator).  The  information  in  the  i,h 
element  in  each  of  the  following  tuples  corresponds  to  the  ith  element  of  the 
tuple  enumerating  the  instruction  set  of  the  processors.  There  should  be  some 
closed  form  of  notation  for  this  section,  for  example:  iaddXX  could  be  used  to 
represent  an  integer  addition  that  is  XX  hits  wide.  Such  a  notation  would  allow 
the  same  assembly  code  to  be  used  on  various  machines  supporting  similar 
operations,  this  would  replace  the  requirement  of  knowing  the  assembly 
language  for  each  unit  in  the  data  base,  with  knowing  one  generic  assembly 
language. 
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The  5-tuple  describing  single  operation  execution  time  of  the  operations  is 
as  follows  (all  times  are  in  processor  cycles): 

(5, 10,(5/11),  7, 7) 

This  5-tuple  describes  the  information  about  the  time  to  execute  each  of  the 
above  operations.  By  describing  all  the  operations  of  the  processor  in  basic 
clock  cycles,  the  description  of  improved  versions  of  a  processor  can  be  easily 
added  to  the  database.  For  this  example,  the  “return  if  zero”  (third  element 
above)  command  is  associated  with  two  times.  This  corresponds  to  the 
execution  time  of  the  conditional  if  it  is  false/true.  The  information  in  this 
tuple,  combined  with  the  number  of  times  each  specific  assembly  language 
instruction  is  executed,  provides  a  worst  case  timing  analysis  for  a  given 
processor. 

The  next  5-tuple  contains  the  number  of  fetches  required  to  execute  each 
operation.  This  is  needed  to  help  describe  systems  where  the  instruction  fetch 
can  be  overlapped  with  the  actual  execution  of  an  instruction.  For  this 
particular  processor,  this  tuple  would  look  like: 

(1,1, 1,U) 

To  account  for  a  unit  with  internal  pipelining,  the  third  tuple  will  contain 
the  number  of  stages  in  the  pipeline  for  each  operation  the  processor  can 
perform.  When  a  specific  command  is  not  pipelined,  the  number  of  stages  in 
the  pipeline  is  1.  (The  following  tuples  must  also  take  the  unit  performing  the 
instruction  fetch  into  consideration.)  Thus,  if  the  8-  and  16-bit  addition  units 
were  5-stage  pipelines  and  the  rest  of  the  unit  was  not  pipelined,  the  6-tuple 
describing  the  pipelining  would  look  like: 
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It  is  possible  for  a  processor  to  have  two  processing  units  that  execute  the 
same  operations  simultaneously,  like  a  micro-processor  equipped  with  two 
adders.  If  for  the  previous  example,  there  were  four  adders,  two  for  8-bit 
operations  and  two  for  16-bit  operations,  the  next  6-tuple  would  look  like: 

(2,2, 1,1, 1,1) 

Finally,  functional  overlap  between  operations,  must  be  considered.  This  is 
done  in  the  final  tuple  that  would  look  like: 

{  memory-register/16-bit  add/fetch, 
memory-register/8-bit  add/fetch, 

8-bit  add/16-bit  add, 

8-bit  add/ 16-bit  add, 

8-bit  add /  16-bit  add  } 

This  6-tuple  shows  that  the  8-bit  add  can  be  overlapped  with  both  memory- 
register  operations  and  the  16-bit  add.  The  16-bit  add  can  be  overlapped  with 
the  memory-register  operations  and  the  8-bit  add.  For  this  example,  the  return 
if  zero  command  cannot  be  overlapped  with  any  operations,  the  memory 
register  operations  can  be  overlapped  with  both  addition  instructions,  and  the 
instruction  fetch  can  be  overlapped  with  the  arithmetic  operations. 

These  last  four  tuples  are  used  to  obtain  tighter  limits  on  the  execution 
time  a  given  processor  will  require  to  execute  a  given  algorithm.  For  example, 
if  the  instruction  fetch  cannot  be  overlapped  with  the  execution  of  any 
instruction,  the  previously  discussed  maximum  execution  time  discussed  is  a 
good  approximation  of  the  actual  execution  time.  By  not  allowing  the  fetch  to 
overlap  with  any  instruction,  each  instruction  must  reach  completion  before 
fetching  the  next  instruction.  This  eliminates  any  possible  functional  overlap. 


) 
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If  the  instruction  fetch  can  be  overlapped  with  the  execution  of  given 
operations  then  whenever  the  execution  time  for  those  operations  exceeds  the 
time  for  the  fetch  of  the  next  operation,  the  fetch  time  for  the  subsequent 
operation  can  be  deducted  from  the  maximum  execution  time  yielding  Tf. 
Consider  the  following  example  (the  bold  wire  represents  execution  time,  the 


narrow  represents  fetch  time) 


Since  the  execution  time  of  the  first  operation  is  overlapped  with  the  fetch  time 
of  the  second,  the  second  operation  can  begin  at  the  termination  of  the  first 
operation,  effectively  eliminating  the  fetch  time. 

Whenever  the  execution  time  of  a  given  operation  exceeds  the  fetch  and 
execution  time  of  following  overlappable  operations,  the  execution  time  of  the 
subsequent  operations  may  be  deducted  from  Tf  to  yield  Toe' . 


The  fetch  overlap  was  taken  into  account  in  the  calculation  of  Tp. 

If  a  unit  is  not  busy,  i.e.,  can  accept  input,  and  its  execution  can  be 
overlapped  with  any  currently  executing  instructions,  it  is  possible  to  overlap 
the  instruction  with  the  presently  executing  instructions.  An  overlappable 
operation  is  any  operation  that  can  be  overlapped  with  the  execution  of  any 
pending  operations  and  that  does  not  use,  any  operand  that  is  not  complete 
when  its  instruction  execution  begins.  It  should  be  noted  that  multiple  units, 
such  as  adders  and  boolean  logic  units,  will  not  decrease  the  execution  time  of 


an  operation.  Replication  of  hardware  units  will  mean  that  there  is  a  larger 
pool  of  units  available,  i.e.,  the  likelihood  of  a  busy  unit  is  decreased,  so  the 
likelihood  that  there  is  a  unit  available  for  a  given  operation  is  increased.  A 
similar  effect  is  noted  for  pipelined  units,  where  if  Tg0  is  the  time  that  the 
pipeline  requires  for  a  single  operation  and  there  are  S  stages  in  the  pipeline, 

Tso 

the  pipeline  is  available  to  accept  an  input  in  time  — -p-.  Greater  depth  in  the 

analysis  of  the  timing  of  parallel  and  pipelined  processing  units,  can  be  found 
in  [Che80], 

If  the  execution  time  of  an  operation  is  exceeded  by  the  fetch  and 
execution  time  of  the  subsequent  overlappable  operation,  the  execution  time  for 
the  first  operation  can  be  deducted  minus  the  fetch  time  for  the  subsequent 
operation  is  subtracted  from  Toe'  to  yield  Toe.  This  is  demonstrated  as  follows 
(again  the  thin  line  represents  the  instruction  fetch  and  the  bold  line  represents 


the  instruction  execution  time). 


The  above  list  of  parameters  used  to  describe  processing  hardware  is  by  no 
means  conclusive,  but  it  does  serve  as  an  example  about  the  type  of 
information  that  must  be  stored  about  processing  hardware  in  the  hardware 
database.  It  may,  for  example,  be  necessary  to  add  an  N-tuple  describing  two 
or  more  units  that  share  a  common  pipeline. 
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The  first  two  5-tuples  can  be  related  to  the  notation  in  [H0J8I],  by: 

E  =  '  B(2  +  8  ,  2  +  $  ,  load87 ,  storeg)/B(compare  and  jump  )5/n 

where  E  is  an  execution  unit  and  B  is  a  boolean  unit.  For  example,  the 
notation  2  +|  shows  that  there  are  two  8-bit  addition  units  that  take  5  cycles 
to  produce  a  single  result.  The  third  tuple  shows  the  relative  construction  of 
the  units  and  would  be  specified  in  [H0J8I]  notation  as  follows: 

’ 

'  +8  =  +  18  “  +28  “  +38  “  +48  “  +58 

'  +  16°  =  +1i26  ~  +2f8-  +3?e-  +4f8-  +5j26 

The  +1  +2  +3  +4  +5  represent  the  various  stages  of  the  pipeline,  while 
the  superscripts  and  subscripts  are  used  to  describe  the  execution  time  and 
pathwidth  of  the  units.  Finally,  the  -’s  show  that  the  units  are  connected  in 
series,  showing  that  there  are  five  stages  in  the  pipelined  adders,  each  stage 
taking  one  cycle.  A  representation  of  the  other  functions  is  not  necessary 
because  the  third  tuple  shows  that  these  units  are  not  pipelined.  By  including 
the  tuples  in  the  database,  it  is  possible  to  completely  re-create  the  desired 
timing  information  stored  in  the  architecture  description  notation  set  forth  in 
[H0J8I]. 


v»  ■»  ■!  ■.«  THTTttt  III, 
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In  addition  to  the  functional  description,  each  device  may  be  classified 
according  to  its  use.  The  categories  useful  for  this  database  (as  stated  in 
(ArB82|)  are:  processor,  memory,  input/output  units,  vector  processors,  and 
array  processors.  Application  of  this  classification  scheme  will  allow  different 
sets  of  parameters  to  be  used  to  classify  devices  in  different  categories.  In 
addition,  these  categories  give  some  idea  of  the  processing  different  units  can 
perform,  although  some  units  may  be  capable  of  performing  various  tasks.  For 
example,  to  input  and  store  data,  unless  preprocessing  is  needed,  an  input  unit 
can  perform  the  same  function  as  a  processor,  i.e.,  the  input  unit  can  be  used 
to  access  a  sensor  and  store  the  sampled  data  in  memory  without  interrupting 
the  processor.  The  hardware  can  be  grouped  by  category  in  the  database, 
decreasing  the  required  search  time. 

For  the  purposes  of  this  paper,  the  units  considered  for  the  database  are 
either  single  chips  or  small  boards.  The  underlying  assumption  for  this  scheme 
is  that  there  is  no  shared  or  reconfigurable  pipeline  units  on  board.  When  this 
assumption  becomes  false,  two  N  + 1-tuples  will  be  required  to  represent  shared 
pipelines  and  their  reconfiguration  times. 

A  functional  description  such  as  that  found  in  [ArB82],  can  be  used  to 
accurately  categorize  each  unit  according  to  its  functional  capabilities.  To  this 
point,  only  processing  hardware  has  been  considered.  The  hardware  database 
can  be  divided  into  the  functional  units  of  processor,  memory,  input/output, 
vector,  and  array  processors.  This  is  consistent  with  [ ArB82] .  Each  of  these 
functional  categories  will  have  a  set  of  tuples  used  to  describe  its  performance. 
The  tuples  will  be  used  with  the  characteristics  of  the  application  algorithm  to 
choose  specific  hardware  for  each  level  of  the  system. 
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Included  with  the  hardware  descriptions  of  the  processors  in  the  database 
would  be  a  routine  that  can  simulate  the  performance  of  the  processor.  By 
combining  the  simulation  procedures  with  the  architectural  information  of 
other  components  in  the  database,  e.g.,  memories,  it  is  possible  to  create  a 
simulator  for  the  proposed  macro-pipelined  architecture.  Such  a  database  with 
simulation  routines  for  each  relevant  component  would  be  a  useful  tool  for  the 
research  community  interested  in  the  design  of  macro-pipelined  special  purpose 
systems. 

4.3.  Response  Time  —  Its  Meanings  and  Interpretations 

The  desired  response  time  can  be  interpreted  in  various  ways  depending  on 
the  application.  For  certain  applications,  the  response  time  may  be  a  function 
of  input.  This  is  discussed  in  detail  in  Section  4.5.  In  such  cases,  the  desired 
response  time  can  be  considered  to  be  an  average  response  time  Trav  or  as  an 
absolute  maximum  acceptable  response  time  Trmax.  Let  Trdes  be  the  desired 
the  response  time  and  TR  be  the  actual  response  time.  If  Trdes  =  Trmax  then  it 
is  required  that  TR  <  Trdes.  This  results  in  a  system  that  will  always  respond 
as  fast  as  or  faster  than  the  desired  response  time  and  is  useful  where  response 
time  is  crucial.  Such  a  system  may  respond  faster  than  is  needed,  thus  the 
hardware  will  not  be  fully  utilized  when  TR  <  Trdes. 

Trdes  can  be  interpreted  to  mean  Trav.  Let  D  be  the  number  of  input  data 
sets  to  be  processed  and  let  TRi  be  the  actual  system  response  time  for  the  ith 
data  set.  If  Trdes  =  Trav,  then 


The  average  response  time  is  Trdes,  but  it  is  possible  for  TR  >  Trdes  on  multiple 
consecutive  occasions.  If  the  processing  times  of  various  data  sets  are  unrelated 
(independent),  the  probability  that  TR.  >  Trdes  on  M  consecutive  occasions  is: 
0.5m.  In  a  real-time  environment,  if  the  system  falls  behind  the  incoming  data, 
there  are  two  cases  that  can  arise.  Either  there  will  not  be  enough  buffering 
and  data  will  be  lost,  or  there  will  be  enough  buffering  and  results  will  be 
delayed.  In  certain  real-time  applications,  such  as  air  traffic  control,  neither  of 
these  alternatives  is  desirable,  so  TrmaJ(  should  be  used  instead  of  Trav. 

In  addition,  it  may  be  necessary  to  specify  both  Trav  and  Trmax.  This 
corresponds  to  the  case  where  an  average  response  time  is  desired  and  where  an 
absolute  ceiling  on  the  response  time  is  needed.  For  the  purposes  of  this  paper, 
it  will  be  assumed  that  Trdes  =  Trav. 

4.4.  Parallelism,  Task  Division,  and  Design  Scenario 

It  is  reasonable  to  ask:  “if  given  a  description  of  a  task,  can  a  computer  be 
designed  to  execute  it?”  Since  various  algorithms  that  perform  a  given  task 
require  varying  types  of  calculations,  memory  space,  interconnection  networks, 
and  execution  times.  For  a  simple  example  of  some  of  the  above  variations, 
consider  an  in-place  sort  (bubble)  [HoS82b],  a  sort  that  requires  extra  memory 
(bin)  [AhH76],  and  a  parallel  sorting  algorithm  ( [Pre77] ).  Assume  the  sorts  are 
performed  on  a  list  containing  Ne  elements,  with  the  largest  element  L  digits 
long.  The  bubble  sort  will  require  0(N")  time  with  no  extra  memory.  The  bin 


sort  will  require  0(NtL)  time  in  2Ne  memory.  The  fast  parallel  sorting 
algorithm  proposed  in  [Pre77]  will  require  0(log2Ne)  time  using 

Nel°g2(Ne  +  l)  |/2  processors.  Since  the  time,  memory  space,  and  optimal 

arrangement  of  hardware  are  functions  of  the  algorithm,  not  the  task,  it  is  best 
to  extract  the  needed  features  from  the  algorithm  and  design  an  architecture  to 
fit  a  specific  set  of  algorithms.  Thus,  to  design  a  system  for  a  given  task,  it 
may  be  necessary  to  evaluate  and  compare  the  use  of  several  different 
algorithms  and  their  associated  hardware  requirements. 

After  the  initial  layering  is  performed,  an  exact  statement  of  the 
application  algorithm  to  be  performed  at  each  level  will  be  used  with  the 
hardware  description  N-tuples  to  evaluate  the  performance  of  each  processor  in 
the  hardware  database.  Then  information  about  the  desired  throughput  and 
average  response  time  (Trdes)  of  the  system  must  be  gathered.  These  will  be 
the  evaluation  criteria,  i.e.,  can  a  proposed  system  process  the  data  with  the 
desired  throughput  and  response  time. 

It  is  possible  that,  for  each  level,  the  exact  algorithm  may  be  available 
only  as  a  selection  of  various  algorithms,  e.g.,  there  may  be  more  than  one 
choice  for  an  algorithm  for  a  level  to  process.  The  speed  at  which  a  given  level 
operates  is  a  function  of  the  algorithm,  so  the  speed  of  the  corresponding  level 
will  be  a  function  of  the  final  algorithm  chosen  for  that  level.  Since  the 
algorithms  determine  the  layering  of  a  task,  not  only  the  architecture  of  a 
single  level,  but  the  entire  system  architecture  depends  on  the  selection.  It  is 
also  possible  that  one  selection  may  require  another,  e  g.,  a  frequency  domain 
process  may  require  that  the  data  be  converted  from  time  domain  to  frequency 
domain.  Thus,  the  entire  system  may  be  a  list  of  possible  alternatives.  Since 
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this  requires  a  precise  computational  model  for  each  level,  each  possible 
alternative  architecture  for  the  system  must  be  explored. 

The  first  step  in  the  modeling  process  is  to  choose  all  levels  to  process 
their  incoming  data  as  fast  as  possible  without  using  vertical  or  horizontal 
parallelism  within  any  given  level.  The  resulting  design  is  a  macro-pipelined 
system. 

Architectures  that  are  designed  along  these  lines  “pipeline”  data  through 
the  levels,  producing  a  continuous  flow  of  data.  The  time  to  process  a  single 
data  set  (the  time  for  data  to  go  from  the  first  level  to  the  last  level,  i.e.,  the 
response  time)  in  such  a  vertical  architecture  is  not  decreased  by  the 
parallelism  of  the  multiple  levels  of  hardware.  The  throughput  for  multiple 
data  sets  is  greatly  increased  because  new  results  are  completed  at  a  rate  equal 
to  the  processing  time  of  the  slowest  level  or  sub-level.  This  is  a  considerable 
improvement  over  a  traditional  serial  design.  If  the  time  to  go  from  the  first 
level  to  the  last  level  is  too  slow,  horizontal  parallelism,  such  as  that  found  in 
SIMD  or  MIMD  machines,  must  be  applied.  The  design  resulting  from  the 
application  of  the  techniques  outlined  in  this  scenario,  will  be  neither  purely 
parallel  nor  purely  pipelined,  but  will  be  a  hybrid  combination  of  both  forms  of 
parallelism. 

If  the  processing  time  for  all  levels  and  sub-levels  of  an  architecture  were 
halved,  the  time  to  go  from  the  first  level  to  the  last  level  would  also  be  halved. 
Thus,  vertical  parallelism  can  be  applied  to  increase  throughput,  while 
horizontal  parallelism  can  be  applied  to  increase  throughput  and  decrease 
response  time. 


Horizontal  parallelism,  however,  is  not  the  cure  for  all  slow  tasks.  The 
limitation  on  horizontal  parallelism  is  the  inherent  parallelism  of  the  subtask  to 
be  performed.  Further,  horizontal  parallelism  is  affected  by  precedence 
constraints  of  the  subtask.  Vertical  parallelism  is  not  affected  by  precedence 
constraints  because  they  are  still  enforced;  however,  vertical  parallelism  will  not 
reduce  the  response  time.  Thus,  there  are  both  associated  costs  and  limitations 
with  both  vertical  and  horizontal  parallelism. 

The  design  of  a  machine  suited  to  a  special  task  can  be  considered  to  be  a 
two  step  process. 

1)  Create  sub-tasks  based  on  conceptual  differences  (vertical) 

2)  Break  down  sub-tasks  based  on  time  requirements 
(horizontal  and  vertical) 

Step  1  creates  the  initial  levels  of  the  hardware.  Since  the  execution  time  may 
vary  extensively  from  level  to  level,  and  the  levels  are  pipelined,  the  execution 
time  of  each  level  should  be  balanced  to  allow  maximum  utilization  of  all 
hardware  present.  That  is,  the  overall  processing  speed  of  the  macro-pipeline 
will  depend  on  the  speed  of  the  slowest  level.  Step  2  would  be  employed  to 
increase  the  throughput  and/or  reduce  the  response  time  of  slower  levels  to 
help  balance  the  execution  times. 

The  next  portion  of  the  design  will  require  an  interface  between  the  sub¬ 
task  and  the  hardware.  Included  in  this  interface  is  the  description  of  the  sub¬ 
task  in  terms  that  relate  it  to  the  requirements  that  it  places  on  the  hardware. 
This  description  is  used  to  design  candidate  architectures,  whose  performance  is 
evaluated  by  some  measure  [SiS82].  It  is  this  portion  of  the  design  that  is  the 
topic  of  discussion  in  the  next  two  sections. 
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The  task  division  here  is  similar  to  those  used  for  Piecewise  Data  Flow 
Architectures  (PDF)  [ReM83]  in  that  a  task  is  divided  into  basic  blocks  called 
sub-tasks.  However,  instead  of  scheduling  the  sub-tasks  for  execution  on  a 
unit,  a  special  unit  is  designed  for  each  sub-task.  The  resulting  design  is  more 
limited  in  scope  than  the  PDF,  but  will  be  better  suited  for  a  specific  task.  By 
designing  each  unit  with  commercially  available  parts,  the  overall  architecture 
can  be  implemented  with  current  technology,  like  the  PDF;  however,  by- 
designing  a  special  purpose  unit,  hardware  unneeded  for  the  specific  task  under 
consideration  can  be  eliminated  from  the  design,  reducing  the  cost  of  the  end 
result. 

In  addition,  the  design  resulting  from  the  PDF  is  composed  of  several 
small  units,  operating  in  a  data  flow  environment,  combined  into  a  larger  more 
powerful  “single”  unit.  A  given  functional  unit  in  a  PDF  may  be  used  several 
times,  by  various  processes  in  the  scenario,  while  the  proposed  levels  will 
execute  layers  on  each  data  set  once.  Further,  the  PDF  is  composed  of  simple 
atomic  units,  while  a  level  in  the  proposed  design  is  a  combination  of 
traditional  serial,  SIMD,  MIMD,  and  pipelined  designs.  This  allows  the  level 
associated  with  each  sub-task  to  be  designed  to  achieve  the  desired 
combination  of  throughput  and  response  time. 

The  research  in  both  [ReM83]  and  [WoC84]  relate  to  work  done  in  [Yic79], 
in  which  a  distributed  computing  system  is  analyzed  with  respect  to 
implementation  of  switching,  bussing,  and  interconnection;  partitioning  criteria; 
and  testing  methodology.  In  [Vic79j,  a  dynamically  reconfigurable  system 
similar  to  a  PDF  is  considered.  There,  graph  theory  is  used  to  go  from  the 
algorithm  to  the  actual  hardware.  The  pertinence  of  the  research  to  the  work 
under  consideration  here  is  that  the  analysis  of  what  amounts  to  a  distributed 


machine  has  been  considered  with  respect  to  robustness  and  throughput. 


There  are  two  limitations  on  the  type  and  amount  of  parallelism  applied 
at  each  level.  The  first  is  that  there  must  be  an  upper  bound  on  the  cost.  An 
additional  limitation  is  placed  on  the  type  and  amount  of  parallelism  by 
requiring  that  all  parts  be  “off  the  shelf.”  This  second  limitation  forces  the 
architecture  to  be  buildable  with  present  day  technology.  These  limitations 
assume  that  an  algorithm  can  be  structured  for  parallel  execution.  If  an 
algorithm  is  unsuitable  for  parallel  execution  then  vertical  parallelism  is 
required. 

The  minimum  horizontal  parallelism  at  any  level  is  a  single  unit,  while  the 
maximum  horizontal  parallelism  is  limited  by  the  inherent  parallelism  of  the 
sub-task  and  cost  of  the  units.  Typically,  each  additional  processor  used  for 
horizontal  parallelism  may  not  increase  the  execution  speed  by  exactly  the 
same  amount,  i.e.,  the  speedup  may  not  be  Np  using  Np  processors  for  any  Np. 
This  is  discussed  in  [Sto73].  As  mentioned  earlier,  the  minimum  vertical 
parallelism  is  one  processor  and  the  maximum  vertical  parallelism  is  one 
processor  per  instruction. 

To  propose  and  evaluate  candidate  architectures  for  levels,  a  mapping  is 
required  between  a  layer  and  its  corresponding  level.  Included  in  this  mapping 
is  the  description  of  the  layer  in  terms  that  relate  it  to  the  computational 
requirements  that  it  places  on  the  hardware.  It  is  this  mapping  that  is  the  topic 
of  discussion  in  the  next  section.  Using  information  from  the  hardware 
database  discussed  in  Section  4.2,  the  performance  of  the  system  can  be 
approximated.  Simulation  is  required  to  insure  that  the  system  will  perform  as 
desired. 
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4.6.  Evaluation  Categories  —  The  Relationship  Between  the  Layer  and 
the  Level 

If  hardware  is  to  be  designed  for  a  specific  algorithm,  characteristics  of  the 
algorithm  must  be  “mapped”  onto  the  hardware.  To  build  hardware  for  a 
given  level,  a  user  must  supply  each  of  the  following  evaluation  categories 
about  each  layer  in  the  system. 


(1)  Type,  rate,  and  amount  of  inputs 

(2)  Type  and  number  of  operations  per  input  datum 

(3)  Range  and  accuracy  of  arithmetic  data  to  be  used 

(4)  Algorithm  to  be  used 

(5)  Type,  frequency,  and  message  length  of  processor-to- 
processor  communications 

(6)  Amount  of  memory  required 

(7)  Type,  amount,  and  benefit  of  parallelism 

(8)  Type,  rate,  and  amount  of  output 

(9)  Evaluation  criteria 

It  is  the  goal  of  this  section  to  use  these  parameters  to  form  a  model  of  the 
algorithms  in  the  task.  By  using  the  information  supplied  by  this  model  with 
the  hardware  model  of  Section  4.2  and  the  design  scenario  of  Section  4.4  it  is 
also  the  goal  of  this  section  to  develop  a  macro-pipelined  architecture  well 
suited  for  performing  the  task. 

The  four  factors  that  can  influence  the  architecture  of  a  specific  level  are 
the  data,  algorithm,  performance  evaluation  criteria,  and  the  input/output 
environment.  (1),  (6),  and  (8)  are  data  related;  (2),  (3),  (4),  (5),  (6),  and  (7)  are 
algorithm  related;  (9)  is  the  evaluation  criteria;  and  (1)  and  (8)  are 
input/output  environment  related.  Since  at  any  layer  or  sub-layer,  the  exact 
algorithm  may  be  one  of  a  set  of  candidates,  the  resulting  architecture  for  a 
given  level  and  all  following  levels  may  also  be  a  list  of  candidates,  with  one 


system  architecture  per  candidate  algorithm.  For  the  purposes  of  discussion, 
the  evaluation  criterion  will  be  speed  and  cost,  i.e.,  the  faster  an  algorithm  can 
be  executed,  the  “better”  the  hardware  design;  however,  the  price  of  the  design 
should  not  be  excessively  expensive.  Several  other  evaluation  criteria  are 
considered  in  [SiS82j  and  [Gon78j. 

Initially,  information  about  the  desired  throughput  and  response  time  of 
the  system  will  have  to  be  known.  The  first  step  in  the  design  process  will 
choose  all  levels  to  process  their  incoming  data  as  fast  as  possible  without  using 
vertical  or  horizontal  parallelism  within  any  given  level.  Since  this  type  of 
design  is  a  macro-pipeline,  the  slowest  level  will  limit  the  throughput  of  the 
entire  pipeline. 

To  fully  utilize  the  hardware  in  the  system,  it  is  desirable  to  match  the 
speed  of  all  the  levels.  There  are  two  design  philosophies  that  can  be  employed 
to  balance  the  throughput  of  the  levels.  After  the  initial  design  (all  levels 
designed  to  execute  their  layer  as  fast  as  possible  with  no  vertical  or  horizontal 
parallelism),  the  data  processing  rate  of  all  the  levels  will  be  known.  If  the 
designed  machine  meets  or  exceeds  the  throughput  and  response  time 
qualifications  of  the  scenario,  faster  levels  can  be  combined  or  built  with  slower 
less  expensive  hardware.  This  will  still  maintain  the  throughput  of  the  system, 
while  increasing  the  response  time.  Such  a  process  can  be  repeated  as  long  as 
the  throughput/response  time  requirements  are  met.  This  will  lower  the  cost 
of  the  overall  system. 

If  the  resulting  macro-pipelined  architecture  (i.e.,  one  with  no  parallelism 
within  a  given  layer)  machine  fails  to  meet  the  throughput  qualification  for  all 
processors  in  the  database,  the  execution  speed  of  all  levels  not  meeting  the 


time  constraint  must  be  increased.  This  can  be  done  with  either  vertical  or 
horizontal  parallelism. 


If  the  machine  fails  to  meet  the  response  time  constraints,  horizontal 


parallelism  can  be  employed  (vertical  parallelism  will  not  improve  response 
time).  Let  TL.  be  the  mean  response  time  for  level  i  to  perform  its 
corresponding  layer,  Trdes  be  the  desired  average  response  time  for  the  system, 
and  Nl  be  the  number  of  levels.  One  way  to  meet  the  response  time  constraint 
is  to  attempt  to  force: 


Alternatively,  the  response  time  criteria  may  be  met  even  if  the  equality 
outlined  in  (1)  is  not  true  for  all  levels;  however,  the  sum  of  the  TL.’s  for  all 
levels  must  be  less  than  or  equal  to  Trdes.  That  is,  the  requirement  is: 

£TLi  <  Trde3  (2) 

i=l 

In  general,  it  is  possible  for  equation  (2)  to  be  true  without  equation  (1)  being 
true,  and  still  satisfy  the  throughput  requirements  (although  the  execution 
times  for  all  levels  of  the  macro-pipelined  system  will  not  be  balanced  and  the 
slowest  level  will  determine  the  throughput).  This  implies  that  the  required 
throughput  is  less  than  one  job  every  (Trdes/NL)  time  units;  i.e.,  the  required 
throughput  is  less  than  (NL/Trd„)  jobs  per  time  unit.  For  this  initial  study, 
however,  equation  (1)  will  be  used  as  a  guideline  for  the  system  design.  Thus, 
for  all  levels  where: 
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horizontal  parallelism  must  be  introduced  to  attempt  to  reduce  TL]. 

For  simplicity,  some  form  of  coordination  can  be  used  between  the  levels. 
Chapter  5  will  explore  the  effects  of  lifting  this  restriction.  For  this  chapter, 
however,  the  coordination  can  be  in  the  form  of  either  (a)  a  master  system 
clock  that  tells  each  level  when  it  can  proceed  to  the  next  data  set  or  (b)  a  unit 
that  keeps  track  of  each  level  and,  when  all  levels  are  done,  signals  each  to 
proceed  to  the  next  data  set.  The  differences  between  these  two 
implementations  are  that  (a)  will  use  less  hardware  than  (b),  and  that  (b)  will 
execute  at  least  as  quickly  as  (a).  If  TL.  is  the  time  required  for  level  i  to 
complete  its  subtask,  then  (a)  must  be  set  for  the  maximum  possible  value  of 
Tl.  over  all  levels  for  all  data  sets.  The  implementation  suggested  in  (b)  will 
require  an  execution  time  Te  of: 

Te  =  max(TLj,TLj,TL3  •  ■  •  ,TLnl) 


While  in  the  extreme  case,  this  will  be  equal  to  the  implementation  suggested 
in  (a),  normally,  TL  will  be  less  than  the  TL  for  the  slowest  possible  level.  The 
following  is  an  analysis  of  how  each  of  the  categories  is  derived,  and  how  it 
affects  the  architecture  of  a  given  level. 

Category  (1)  relates  the  input  characteristics  of  the  system  to  the  I/O 
environment  in  which  it  will  execute.  It  places  restrictions  on  the  input 
buffering,  input  data  rate,  and  the  internal  data  format  of  a  level.  The  type  of 
data  specifies  the  format  and  word  width  required  to  process  the  incoming 
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data.  Combined  with  the  rate,  the  type  of  data  specifies  the  speed  of  the  input 
unit.  Consider  a  situation  in  which  250  32-bit  floating  point  numbers  and  1000 
8-bit  integer  numbers  must  be  processed  in  one  second.  The  types  of  input 
data  are  specified.  By  combining  the  type  with  the  rate,  2000  bytes  of  data  per 
second  must  be  either  processed  or  stored  if  the  unit  is  to  operate  without 
losing  data.  The  input  rate  is  required  for  the  first  level  only,  since  all  other 
levels  are  transferring  data  by  a  common  clock. 

The  proposed  architecture  will  overlap  data  transfer  between  levels  (and 
sub-levels)  with  the  computation  in  those  units.  Consider  the  example  shown  in 
Fig.  4.5.1  (a),  where  each  level  is  connected  to  the  next  level  through  a  buffer. 
Shown  next  to  each  level  is  its  triple  or  swinging  buffer  memory:  one  unit  for 
data  currently  being  operated  upon,  one  unit  for  storing  data  previously 
generated  by  that  level  (and  currently  being  sent  to  the  next  level),  and  one 
unit  to  receive  data  currently  being  sent  by  the  previous  level.  This  scheme  was 
proposed  in  [Dem83]  and  is  quite  useful  towards  this  application  because  it 
allows  the  overlap  of  data  transmission  with  the  actual  data  processing,  so  each 
level  is  effectively  sending  a  data  set,  processing  a  data  set,  and  receiving  a 
data  set  simultaneously.  It  is  assumed  that  the  data  sets  are  actually 
transmitted  via  some  DMA  device. 

For  example,  consider  the  data  sets  in  the  figure  using  level  i's  swinging 
buffers.  Data  set  E  is  being  sent  by  level  i-1  (which  previously  generated  it)  to 
level  i(which  will  process  it  after  it  finishes  processing  data  set  D).  Data  set  D  is 
currently  being  processed  by  level  i.  Data  set  C  is  being  sent  from  level  i  to 
level  i  +  1  (which  will  process  it  after  it  finishes  processing  data  set  B).  The 
transmission  of  data  sets  A,  C,  E,  and  G,  and  the  processing  of  data  sets  F.  D, 
and  B,  are  all  occurring  simultaneously.  The  time  to  perform  these 


simultaneous  transmissions  and  computations  is  called  an  interval.  In  the 
next  interval,  data  sets  B,  D,  and  F  will  be  transmitted  and  data  sets  A,  C,  E, 
and  G  will  be  processed.  Similarly,  in  the  interval  prior  to  the  one  shown  in  the 
figure,  data  sets  A,  C,  and  E  were  processed  and  data  sets  B,  D,  and  F  were 
transmitted.  In  summary,  an  interval  is  the  time  required  for  a  level  to 
simultaneously  receive  a  data  set,  transmit  a  data  set,  and  process  a  data  set, 
such  as  level  i  does  with  data  sets  E,  C,  and  D,  respectively  in  Fig.  4.5.1  (a)). 

Since  temporary  results  are  stored  in  memory  that  is  local  to  the 
processors  accessing  the  buffer,  calculation  of  the  size  of  the  buffer  between 
level  i  and  level  i+1  is  straightforward.  If  iss-,  is  the  input  set  size  for  level  i 
and  oss;  is  the  output  set  size  for  level  i,  the  buffer  memory  required  for  level  i 


memory  buffer  =  3  x  max(iss;.oss;) 


Under  different  conditions,  such  as  those  discussed  in  Chapter  5,  it  is  possible 
that  double  buffering,  as  shown  in  Fig.  4.5.1  (b),  can  be  employed  instead  of 
triple  buffering.  In  this  scheme,  each  level  processes  the  data  in  the  “A" 
portion  of  its  input  buffer  while  writing  the  results  of  the  calculations  in  the 
“B”  portion  of  its  output  buffer.  Then,  the  levels  process  the  information  in 
the  “B”  portions  of  their  input  buffers  and  write  the  results  in  the  “a”  portions 
of  their  output  buffers.  For  this  scheme,  the  buffer  memory  becomes: 

memory  bufIer  =  2  x  maxfissj.oss;) 


The  type  and  amount  of  operations  (2)  indicate  what  must  be  performed 
to  process  incoming  data.  There  are  two  classes  of  algorithms  that  are  of 


159 


concern  for  this  category.  There  are  those  that  perform  the  same  operations  on 
each  data  element  (data  independent)  and  those  that  treat  each  data  element 
differently  (data  dependent).  For  data  independent  algorithms,  the  number  and 
type  of  each  operation  performed  is  countable  from  the  algorithm.  Some 
examples  of  algorithms  that  are  data  independent  are  smoothing  [SiS83]  and 
maximum  likelihood  classification  [SiS80]. 


A  reasonable  indication  of  the  data  dependence  of  an  algorithm  can  be 
defined  by  the  following  test  equation: 

„  ^  ~  ,  _  Data  Dependent  Operations 

Data  Dependency  —  - — -c— — - . - 

Total  Operations 


The  smoothing  and  maximum  likelihood  classification  have  Data  Dependency 
0. 

Examples  of  data  dependent  algorithms  include  contour  tracing  [TuA83], 
calculation  of  Fourier  descriptors  [SiS83],  and  calculation  of  center  of  mass 
(SiS83).  For  a  data  dependent  algorithm,  the  Data  Dependency  may  not  be 
obtainable  from  the  algorithm  alone,  as  the  Data  Dependency  may,  itself,  be 
data  dependent.  In  such  cases,  the  Data  Dependency  must  be  determined 
through  simulation  on  a  sample  data  set.  Typically,  data  dependent  algorithms 
require  varying  resources  and  processing  times. 

The  Data  Dependency  is  a  valuable  measure,  as  it  gives  a  figure  of  merit 
to  the  number  and  type  of  operations  performed  to  process  data.  In  cases  where 
the  Data  Dependency  must  be  determined  by  simulation,  the  average  number 
and  type  of  operations  per  datum  must  also  be  determined.  The  Data 
Dependency  can  be  used  as  an  indication  of  the  appropriateness  of  SIMD  or 
MIMD  parallelism. 


A 


The  class  of  operations  can  be  divided  into  five  groups:  (A)  arithmetic.  (B) 
addressing,  (C)  index  calculation,  (D)  conditional,  and  (E)  data  transfer.  These 
classes  were  chosen  to  yield  information  about  which  unit  can  process  an 
operation.  For  example,  on  some  SEMD  systems  operations  in  class  C  can  be 
done  in  the  control  unit,  overlapped  with  the  parallel  execution.  The  rest  of 
the  operations  are  done  by  the  processing  elements.  Information  about  class 
(E)  indicates  how  much  the  network  will  be  used.  On  a  system  where  all 
processing  is  done  by  the  same  unit,  the  distinction  between  the  types  of 
operations  is  diminished;  however,  for  analysis  they  should  prove  useful. 

Information  about  the  various  categories  will  have  to  be  further  sub¬ 
divided  to  provide  information  necessary  to  choose  suitable  processing 
hardware.  For  example,  category  (A)  should  be  divided  into:  floating  point 
additions,  subtractions,  multiplications,  divisions,  comparisons,  and  special 
functions;  and  fixed  point  additions,  subtractions,  multiplications,  divisions, 
comparisons,  and  special  functions.  The  usefulness  of  this  list  is  that  it 
indicates  the  relative  importance  of  the  speed  of  each  operation. 

For  each  floating  point  or  fixed  point  special  function,  the  number  of  times 
each  operation  is  expected  to  be  performed  is  specified  along  with  an 
equivalence  relation,  giving  the  number  of  “standard"  operations  needed  to 
implement  the  specified  function  in  software.  If  a  unit  cannot  perform  a 
specified  function  in  hardware,  the  time  required  to  synthesize  that  function 
(specified  by  the  equivalence  relation)  must  be  calculated.  By  using  an 
equivalence  such  as  this,  various  units  can  be  ranked  by  their  execution  speed 
for  a  given  algorithm. 


Consider  an  algorithm  that  requires  only  1000  ex  operations.  If  a 
particular  unit  has  the  “built  in"  capability  to  calculate  one  ex,  then  (using 


[Har68]),  calculating  e*  is  worth  nine  multiplies,  nine  additions/subtractions, 
one  floor,  one  square  root,  and  one  division.  By  using  this  equivalence,  if  the 
hardware  unit  requires  10  msec  to  calculate  each  ex,  then  any  unit  not  having 
this  special  purpose  hardware  must  perform  the  listed  operations  1000  times  in 
ten  seconds  if  they  are  to  be  “as  fast”  as  the  hardware  unit. 

The  usefulness  of  this  list  is  that  it  indicates  the  relative  importance  of  the 
speed  of  each  operation.  For  example,  if  there  is  only  one  floating  point  divide 
to  be  performed  on  the  entirety  of  a  given  layer,  a  hardware  floating  point 
divide  is  likely  to  have  little  consequence.  It  may  be  necessary,  as  shown  in 
Section  8,  to  subdivide  the  fixed  point  operations  into  two  categories 
discriminating  between  indexing  operations  and  integer  calculations.  This  is 
required  in  the  event  of  SIMD  parallelism,  where  the  control  unit  has  a 
different  data  pathwidth  than  the  processing  unit. 

Evaluation  category  (2)  helps  place  a  value  on  TLj  in  terms  of  the  actual 
operations  that  must  be  performed.  From  one  data  set  to  another  the  required 
processing  may  vary,  so  an  exact  statement  of  what  operations  must  be 
performed  may  be  unavailable;  however,  a  reasonable  estimate  may  be 
calculated  for  the  average  case  through  simulation  techniques,  as  was  done  with 
Sobel  edge  detection  in  [SiS83].  In  this  edge  detection  algorithm,  if  a  pixel  is 
not  an  edge  pixel  for  an  object,  it  is  essentially  unused;  however,  if  a  pixel  is  an 
edge  pixel,  it  is  used  in  the  calculation  of  a  chain  code  that  describes  the  edges 
of  a  closed  object.  This  chain  code  is  then  used  for  further  processing  (in  this 
case,  the  chain  code  would  be  passed  on  to  another  level). 

Estimates  on  the  calculations  to  be  performed  can  be  used  to  determine 
processing  speed  and  special  hardware  requirements  for  a  given  level.  Simply, 
if  an  algorithm  requires  large  numbers  of  a  given  type  of  operation,  the 


corresponding  level  should  have  hardware  to  perform  that  operation  quickly. 

The  numerical  range  and  accuracy  of  a  sub-task  (3)  is  a  function  of 
algorithm  and  data.  For  an  algorithm,  it  is  necessary  to  determine  the 
maximum  and  minimum  values  of  the  range  of  the  calculations.  The  range  of 
the  calculations  should  be  divided  according  to  the  range  of  index  values,  range 
of  integer  arithmetic,  and  the  range  of  floating  point  arithmetic.  This  specifies, 
in  the  SIMD  case,  the  word  size  of  the  control  unit,  and  the  word  size  of  the 
integer  and  floating  point  units.  In  other  cases  (SISD/MIMD)  [Fly66j,  the  word 
size  of  the  integer  unit  is  set  according  to  the  maximum  range  required  for 
integer  and  indexing  arithmetic.  It  is  assumed  that  the  floating  point  and 
integer  hardware  can  have  different  widths.  An  example  of  this  is  the  PDP- 
11/70,  where  a  single  precision  integer  is  16-bits  and  a  single  precision  floating 
point  number  is  32-bits. 

Category  (3)  places  various  limitations  on  the  hardware.  Typically,  more 
accurate  hardware  (larger  words)  will  be  slower  and  more  costly.  Floating 
point  operations  are  typically  slower  than  the  corresponding  integer  operations. 
In  certain  cases,  if  the  numerical  range  required  for  various  calculations  is 
small,  but  out  of  the  range  of  specific  hardware,  e.g.,  underflow,  normalization 
of  data  can  eliminate  the  need  for  special  hardware  at  the  cost  of  some 
processing  time.  The  arithmetic  range  associated  with  a  set  of  operations 
greatly  affects  the  hardware  required  [SmS81].  If  only  8-bit  precision  is  needed, 
a  32-bit  processor,  which  is  typically  more  expensive,  memory  intensive,  and 
slower  than  a  corresponding  8-bit  processor,  will  offer  no  benefits  in  exchange 
for  the  extra  word  length.  Certain  processors,  such  as  the  Am951lA  [Amd82], 
have  varying  precision  and  can  be  employed  in  cases  where  arithmetic  ranges 
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vary  from  loop  to  loop.  Other  approaches  to  dynamic  word  size  machines  are 
presented  in  [KaK78]  and  [LiT77]. 

Determination  of  the  type  and  amount  of  processor-to-processor 
communication  (4)  for  a  highly  data  independent  task  is  straightforward.  In 
the  case  of  non-uniform  tasks,  the  required  transfers  may  vary  randomly  in  size 
and  connection,  dependent  solely  on  the  data  set  being  processed.  In  this 
situation,  simulation  may  be  required  to  achieve  accurate  estimates  for  the 
average  case.  To  minimize  the  need  for  simulation,  analysis  of  the  data  set  can 
yield  information  about  the  required  connections.  For  example,  if  a  process  is 
edge  tracing  small  objects  relative  to  the  size  of  the  image  being  processed, 
global  connections  are  not  required,  only  local  (nearest  neighbor)  connections 
are  needed.  If  the  objects  are  large,  then  global  connections  may  be  needed. 

In  addition,  with  knowledge  about  the  algorithm  a  level  is  to  process  (4), 
special  parallel  analysis  techniques,  such  as  those  discussed  in  [Ber66],  [RaG69], 
and  [KuM72]  can  be  employed  to  utilize  “extra”  parallelism.  This  can  be 
accomplished  by  breaking  the  algorithm  down  into  multiple  streams,  using 
MIMD  parallelism.  Applicable  loops  are  those  containing  variables  that  can  be 
calculated  independently  of  other  variables  within  the  loop.  The  “breakdown” 

I  occurs  when  a  variable  can  be  extracted  from  a  loop  and  calculated  in  a 

separate  environment  (either  a  different  processor  or  processors).  Other 
techniques  for  parallel  processing  such  as  the  use  of  “recursive  doubling”  for 
I  calculating  sums  or  maximums  [Sto80j  can  be  applied. 

The  algorithm  is  required  to  obtain  timing  information  from  the  previously 
discussed  tuples.  By  deriving  boundaries  on  execution  time  as  described  in 
^  [HuL82],  levels  requiring  large  amounts  of  time  can  be  analyzed.  The  algorithm 

must  be  scanned  to  determine  what  operations  can  be  pipelined  and/or 

i 


overlapped.  This  must  be  done  for  each  processor  in  the  database.  After  the 
amount  of  time  saved  by  parallelism  and  pipelines  is  determined,  this  time  is 
then  subtracted  from  the  execution  time  for  the  processor.  For  systems  with 
recon figur able  pipelines,  the  reconfiguration  times  must  be  multiplied  by  the 
number  of  reconfigurations  required  by  the  algorithm.  This  will  give  indication 
as  to  where  each  level  is  spending  its  execution  time. 

If  consistent  variable  names  are  used  from  layer  to  layer,  similar  task 
decomposition  to  the  above  can  be  applied  across  levels  to  allow  the 
combination  and/or  sub-division  of  levels  as  needed.  Consider  the  scenario  in 
Fig.  4.5.2.  The  three  boxes  represent  levels  one,  two,  and  three.  If  level  three 
calculates  a,  b,  and  c  independently  of  the  output  of  level  two,  and  the 
throughput  of  level  three  is  too  low,  the  portion  of  the  algorithm  calculating 
a,b,  and  c  can  be  moved  to  level  2.  If  this  makes  the  throughput  of  level  two 
too  low,  a  separate  unit  can  be  employed  for  the  calculations.  The  result  is 
shown  on  the  right  of  the  figure. 

The  type,  frequency,  and  message  length  of  the  processor-to-processor 
communications  within  a  layer  (5)  will  dictate  the  topology  of  a  level  and  the 
design  of  the  interconnection  network.  There  are  two  types  of  interconnection 
networks.  A  global  interconnection  networks  allows  a  processor  to  communicate 
directly  with  any  other  processor  within  a  given  horizontally  parallel  structure 
(e.g.,  SIMD  or  MIMD  portion  of  the  machine).  Typically,  a  multistage 
arrangement  is  used  for  such  a  network  [Sie85].  The  second  type  of 
interconnection  network  is  a  local  interconnection  network,  which  allows  a 
processor  to  communicate  with  a  specific  number  of  its  neighbors  (e.g.,  4-  or  8- 
nearest  neighbors)  [SmS81],  In  this  case,  the  processors  can  be  viewed  as  either 
a  one,  two,  or  three  dimensional  array  when  determining  the  connections  to  be 


made  by  the  network.  A  network  must  be  capable  of  making  the  desired 
connections  efficiently  and  with  a  minimal  number  of  collisions,  to  avoid 
significant  delays  during  transfers.  It  would  be  desirable  to  have  a  database  of 
known  global  connection  networks  and  the  permutations  that  they  can 
perform,  so  an  appropriate  connection  network  can  be  chosen. 

From  the  type  of  communications  required  by  a  layer,  information  can  be 
gained  about  the  type  of  processing  that  should  take  place  on  a  given  level,  i.e. , 
the  more  random  the  communications,  the  more  likely  that  a  horizontally 
parallel  level  should  use  MIMD  (or  asynchronous)  parallelism,  as  opposed  to 
SIMD  (synchronous)  parallelism.  Knowing  the  size  of  the  transfers  will  aid  the 
design  of  the  network.  For  instance,  the  longer  the  transfers,  the  more  suitable 
a  circuit  switched  network  becomes.  For  small  transfers,  a  packet  switched 
network  is  desirable.  Knowing  the  number  of  network  transfers  in  conjunction 
with  the  size  of  the  average  transfer  will  provide  information  about  the  loading 
of  a  network  with  a  given  transfer  speed.  Consider  an  environment  where  most 
of  a  processor's  data  is  stored  in  memories  associated  with  other  processors.  In 
an  MIMD  environment,  a  slow  network  will  have  collisions  within  the  network 
in  addition  to  the  collisions  at  the  memories.  (An  ideal  network  with  zero 
transfer  time  will  only  have  collisions  at  the  memories.) 

The  amount  of  memory  (G)  is  an  important  factor  in  the  design  of  a 
system  and  is  a  function  of  the  proposed  data  set  size,  data  type,  and 
algorithm.  Memory  usage  falls  into  three  classes:  program  memory,  stack 
memory,  and  data  memory.  Program  memory  (size  of  the  binary)  is  not 
determinable  from  the  algorithm.  It  is  a  function  of  the  machine  and  the 
compiler.  The  stack  memory  contains  arguments  to  subroutines,  return 
addresses,  and  temporary  information.  Its  size  is  a  function  of  the  nesting  of 
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subroutines,  along  with  the  amount  of  information  that  is  passed  to  those 
subroutines.  For  data  dependent  algorithms  that  use  some  form  of  recursion, 
simulation  may  be  required  to  determine  the  appropriate  amount  of  stack 
memory  needed.  An  alternative  to  simulation  is  to  place  a  maximum  depth  (in 
terms  of  calls  to  specific  functions)  on  the  stack.  If  each  specific  function  is 
called  with  a  given  number  of  arguments  (each  with  a  given  size),  calculation  of 
the  stack  size  is  straightforward. 

The  data  memory  size  is  the  sum  of  the  index  memory  size  and  the 
process  data  size,  where  the  index  memory  is  the  memory  required  to  store 
index  variables  and  loop  counters  and  the  data  memory  is  the  memory  used  to 
store  the  actual  data  set  and  intermediate  results.  In  general,  data  set  size  can 
be  calculated  from  an  algorithm. 

The  particular  divisions  of  memory  stem  from  where  the  data  must  be 
accessed.  In  an  SIMD  environment,  the  stack,  index  memory,  and  program 
memory  must  be  associated  with  the  control  unit,  while  the  process  data  must 
be  accessible  by  the  processing  elements.  In  other  environments,  this  memory  is 
associated  with  the  processor,  so  the  divisions  do  not  matter  so  much  as  their 
total. 

The  memory  size  is  an  important  factor  in  the  design  of  a  system.  The 
data  set  size,  number  of  processors  in  a  level,  and  algorithm  chosen  have  a  very 
important  bearing  on  how  much  memory  is  associated  with  a  processor  in  a 
given  level.  The  previously  discussed  level-level  buffering  is  not  considered 
with  this  value. 

The  type  of  parallelism  (7)  can  be  determined  by  employing  a  special 
algorithm  for  a  specific  type  of  machine,  or  by  determining  whether  an 
algorithm  is  best  suited  for  a  specific  environment.  One  factor  that  influences 


this  decision  is  the  Data  Dependency,  as  discussed  above.  For  a  general 
parallel  algorithm,  the  lower  the  Data  Dependency,  the  more  likely  an 
algorithm  is  suited  to  SIMD  type  processing.  In  SIMD  mode,  some  processors 
may  be  disabled  while  other  processors  execute  portions  of  an  algorithm. 
MIMD  mode  does  not  have  this  drawback.  Instead,  MIMD  hardware  does  not 
typically  overlap  control  unit  instructions  with  processing  element  instructions. 
On  an  MIMD  system,  it  may  be  necessary  to  synchronize  the  processors  to 
insure  that  certain  processing  has  been  done  before  execution  continues. 

The  amount  of  parallelism  can  be  determined  by  several  criteria. 
Typically,  the  larger  the  number  of  processors,  the  less  processing  each  must 
perform  and  the  more  significant  transfer  and  wait  times  become.  As  transfer 
and  wait  times  become  more  significant,  the  processors  will  spend  a  larger 
portion  of  time  idled,  so  the  utilization  of  a  processor  will  decrease.  The 
question  to  be  answered  here  is:  “At  what  point  is  the  utilization  of  a  processor 
more  important  than  raw  speed?”  If  different  instructions  require 
approximately  the  same  amounts  of  time,  a  reasonable  estimate  for  this  figure 
can  be  obtained  as  a  ratio  of  instructions  to  instructions  plus  waits.  Thus,  the 
utilization  can  be  obtained  as  a  ratio  of  time  spent  processing  data  to  the  time 
spent  on  the  entire  task  (processing  and  waiting).  The  Utilization  can  be 
defined  as: 


Utilization  = 


total  processing  time 
total  job  time 


A  variety  of  performance  measures  are  discussed  in  [SiS82j.  These  can  be 


used  to  determine  the  relative  benefit  of  each  additional  processor,  allowing  one 
to  decide  on  the  number  of  processors  associated  with  a  given  level. 
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Analyzing  the  algorithm  for  inherent  parallelism  with  techniques  such  as  1 

j 

those  in  [Ber66],  [RaG69],  and  [KuM72]  provides  insight  into  how  additional 
SIMD  or  MIMD  (horizontal)  parallelism  can  be  utilized  to  increase  execution 
speed.  Consider  the  case  of  two  non-concentric  loops  which  do  not  require 
information  from  each  other.  If  they  are  processed  on  two  independent 
machines,  as  opposed  to  one  machine,  the  results  will  be  the  same,  but  the 
execution  time  will  decrease. 

The  type  and  amount  of  parallelism  will  specify  the  nature  and  maximum 
number  of  processors  associated  with  a  given  level.  The  benefit  due  to 
parallelism  is  specified  in  two  areas:  the  speedup  due  to  Np  processors  and  the 
maximum  value  of  Np.  If  speed  is  the  only  criterion,  then  the  number  of 
processors  associated  with  a  given  level  could  become  quite  large.  Consider 
smoothing,  in  which  the  value  for  a  pixel  (picture  element)  is  replaced  by  the 
average  of  itself  and  the  values  of  its  eight  nearest  neighbors.  In  a  case  where 
transfer  time  is  negligible,  the  fastest  parallel  algorithm  can  smooth  a  pixel  in 
eight  additions  and  one  division.  This  is  the  case  where  each  processor  is 
associated  with  one  pixel.  For  small  images  this  is  feasible;  however,  for  larger 
images  the  cost  due  to  the  large  number  of  processors  becomes  prohibitive.  The 
cost  limitation  on  a  given  stage  limits  the  amount  of  parallelism.  This  has  the 
side  effect  of  limiting  the  significance  of  the  network  transmission  rate 
(typically,  as  the  number  of  processors  increases,  the  effect  of  the  network 

I  transfer/collision  rate  becomes  more  significant). 

Knowledge  of  the  type,  rate,  and  amount  of  output  (8)  will  be  required  for 
any  formatting  that  must  be  done  to  interface  the  data  to  the  device  gathering 

\ 

I  the  results.  In  addition,  it  places  constraints  on  the  output  data  rate. 
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Finally,  the  evaluation  criteria  (9)  defined  how  the  merit  of  a  system  is  to 
be  calculated.  By  incorporating  this  into  the  design  procedure,  proposed  designs 
not  meeting  the  evaluation  criteria  can  be  avoided.  In  addition,  this  provides  a 
way  to  rank  various  designs. 

4.6.  An  Isolated  Word  Recognition  System  —  Task  Description 

Consider  the  application  of  the  above  theory  to  an  isolated  word 
recognition  system.  From  [Yod82],  isolated  words  are  those  that  are  surrounded 
by  distinct  pauses.  Fig.  4.6.1  [Yod82]  is  a  diagram  showing  the  proposed 
scenario.  The  computational  portion  of  this  task  will  be  everything  past  the 
digitization.  To  be  useful,  the  resulting  design  should  process  the  data  in  real¬ 
time.  To  work  with  telephone  quality  speech,  the  system  will  have  to  process 
6,670  16-bit  words  per  second.  This  is  the  minimum  speed  limitation  on  the 
hardware.  In  addition,  there  is  a  minimum  response  time.  For  example,  one 
would  not  wish  to  have  the  delay  between  an  utterance  and  its  recognition  of 
more  than  a  few  seconds.  Such  a  system  has  been  discussed  in  [RaL79], 

Conceptually,  the  task  may  layered  as  shown  in  Fig.  4.6.2.  Layer  1  must 
pre-emphasize  the  input  signal  with  the  following  Z-transform: 

H(z)  =  1  -  0.95z_1 

According  to  [RaL79],  this  serves  to  reduce  the  variance  in  later  calculations 
(linear  predictive  coding).  From  [Oga70],  H(z)  translates  into: 


S(M)  =  S(M)-0.95£(M-1) 


*yw«F  r_»  wTw  w’ 
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where  S(M)  is  the  Mth  sample  of  the  incoming  signal  S. 

There  has  been  discussion  about  what  type  of  arithmetic  is  required  for 
this  process  [MaG74].  In  [MaG74],  it  has  been  shown  that  the  desirable  number 
of  total  bits  in  the  word  is:  f  sampling  rate  in  kHz  +  8  1  Thus,  for  a  6.67kHz 
sampling  rate,  a  wordwidth  of  15  bits  is  acceptable.  Since  the  incoming  data  is 
16  bits  wide,  this  represents  more  accuracy  than  is  needed.  To  maintain 
accuracy  through  the  various  levels,  a  word  width  of  at  least  24  bits  will  be 
considered.  This  represents  an  additional  8  bits  to  minimize  error.  It  should  be 
noted  that  this  is  9  bits  larger  than  the  minimum  word  size  suggested  in 
[MaG74]  and  is  included  only  to  minimize  any  rounding  error.  Floating  point 
calculations  can  be  avoided  by  using  the  following  (non-integer)  fixed  point 
format: 


j  mantissa  |  fraction  J_ 
23  8  7  0 


Thus,  £  may  be  obtained  with  one  integer  fetch,  one  fixed  point  multiplication, 
one  fixed  point  subtraction,  one  fixed  point  addition,  one  fixed  point  store,  and 
two  fixed  point  register-to-register  transfers.  In  order  to  keep  up  with  incoming 
data,  the  level  performing  this  calculation  must  perform 


6670  16-bit  integer  fetches 
6670  16-bit  integer-to-floating  point  conversions 
6670  fixed  point  multiplications 
6670  fixed  point  subtractions 
6670  fixed  point  stores 
13340  fixed  point  register-to-register  moves 

every  second.  The  fixed  point  operations  are  distinguished  from  the  integer 
operations  only  by  the  24-bit  word  width. 

This  particular  scenario  performs  its  analysis  by  using  the  autocorrelation 
method  of  finding  the  linear  predictive  coding  (LPC)  analysis.  The  underlying 
assumption  of  the  autocorrelation  method  is  that  S(0)  and  S(M-l)  are  both  0 
for  a  window  containing  M  samples.  To  make  this  condition  true,  a  Hamming 
window  is  applied  to  g.  The  resultant  equation  is: 

§{m)  =  S(m)  x  W(m)  0 <  m < \1 

where  W(m)  [Yod82]  is  defined  by: 

W(m)  =  0.54  -  0.46cos  -r~^ 

'  '  M-l 

and  M  is  the  number  of  samples  per  frame.  The  frame  length  is  fixed  and 
contains  from  100  to  400  samples.  Note  that  now  calculations  are  in  terms  of 
frames  as  a  basic  unit,  as  opposed  to  one  data  element.  A  typical  method  uses 
300  sample  frames  that  begin  every  100  samples.  For  the  purposes  of 
calculation,  W(m)  only  needs  to  be  calculated  once  and  loaded  with  the 


program  as  a  set  of  constants.  Thus,  for  each  300  sample  frame,  this  portion  of 
the  task  will  require  300  fixed  point  fetches,  multiplications,  and  stores.  In 
addition,  windowing  will  require  one  integer  load  and  299  integer  additions. 
These  operations  must  be  performed  every  14.9  msec  because  the  frames  begin 
every  100  samples. 

After  the  windowing  has  been  performed,  the  autocorrelation  coefficients 
are  calculated  using  the  following  equation: 

M-i-i 

R(i)  =  s(m)s(m  +  i)  0<i<p 

m  =0 

Normally  p  is  between  6  and  25.  For  the  purposes  of  this  paper,  p  is  8.  Since 
there  are  300  multiplications  and  299  additions  that  must  be  performed  for 
every  frame,  the  incoming  data  must  be  converted  into  32-bit  fixed  point 
representation.  This  can  be  accomplished  by  either  zero  filling  or  sign  extension 
of  the  product  terms.  Thus,  this  layer  will  require  2764  fixed  point  additions 
and  2773  fixed  point  multiplications  every  14.9  milliseconds.  From  this  point, 
the  data  is  passed  onto  the  next  layer,  where  LPC  analysis  is  performed. 

The  goal  of  LPC  analysis  is  to  reduce  the  number  of  parameters  that  are 
required  to  represent  the  speech  frame.  LPC  analysis  assumes  that  each 
sample  is  a  linear  combination  of  the  p  previous  samples  and  an  excitation.  By 
assuming  that  the  speech  is  0  outside  the  present  frame,  i.e., 
s(m)=0  for  m<0  and  m>M,  the  LPC  analysis  can  be  broken  down  to  the 
following  equation  [RaS78]: 

f>(k)R(|i-k|)  =  R(i) 


l<i<p 


Since 

Rm 

R  2) 
R  3) 
R  4) 
R  5) 
R  6 
R  7 
R  8 


:  p=8, 

=  a(l 
=  all 
=  all 

—  all 

—  all 
=  all 
=  all 
=  all 


the  above  equation  translates  to: 
|R(0)  4-  a(2)R(l)  +  a(3)R(2)  4-  . 

|R(1)  4-  a(2)R(0)  4-  a(3)R(lj  4-  . 

iR(2)  4-  a(2)R(l)  4-  a(3)R(0j  4-  . 

|R(3)  +  a(2)R(2)  4-  a(3)R(l)  4-  . 

|R(4)  +  a(2)R(3)  4-  a(3)R(2)  4-  . 

iR(5)  +  a(2)R(4)  4-  a(3 )R (3)  4-  . 

|R(6)  +  a  2  R  5  +  a  3  R  4  +  . 

|R(7)  +  a  2  R  6  +  a  3  R  5  +  . 


a(7)R(6' 
a  7  R  5 
a(7)R(4 
a(7  R  3 
a  7  R  2] 
a  7  R(l) 
a  7)R(0 
a  7  R  1 


+  a(8)R(7) 
4-  a(8)R(6) 
4-  a  8)R(5 
+  a(8)R(4) 
+  a(8)R(3) 
+  a(8)R  2) 
+  a(8  R  1 
+  a(8)R(0) 


This  is  equivalent  to:  R  =  RU  where  R  is  an  8-by-8  Toeplitz  matrix  (symmetric 
with  only  one  unique  value  along  the  diagonal)  and  a  and  R  are  8-by-l  vectors. 
Having  Ft  and  R,  the  goal  is  to  solve  the  above  equation  for  a.  which  can  be 
done  by  calculating  R-1.  There  are  algorithms  that  can  utilize  the  special 
properties  of  R  to  calculate  R-1  in  fewer  than  the  0(p3)  (  <  k,p3)  operations 
normally  required.  One  such  method  is  Durbin’s  Algorithm,  shown  in  Fig. 
4.6.3,  which  calculates  the  a’s,  as  opposed  to  R~‘.  The  computational 
requirements  of  Durbin’s  Algorithm  for  the  calculation  of  the  LPC  coefficients 
for  an  8-pole  autocorrelation  method  analysis  are: 


18  integer  initializations, 

376  integer/index  additions/subtracts. 
72  integer  comparisons, 

8  floating  point  initializations, 

89  fixed  point  assignments, 

28  fixed  point  additions, 

44  fixed  point  subtractions, 

72  fixed  point  multiplications, 

8  fixed  point  divisions, 

225  fixed  point  fetches, 

64  fixed  point  fetches  *  , 

64  fixed  point  stores  *  , 

128  integer/index  operations 


every  14.9  msec.  (Items  marked  with  an  asterisk  are  required  for  formatting 
R.)  After  this  point,  only  the  a’s  and  R(0)  are  passed  on  to  the  next  layer. 
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EW  =  R(0); 

FOR  i «-  l  TO  p  DO 

/*  compute  k(i)  */ 

k(i)  -  0; 

FOR  j  -  1  TO  i-1  DO 
k(i)  -  k(i)  +  »<'•'>  •  R(i-j); 
k(i)  -  |R(i)  -  k(i)l  /  E''-"; 

E*'1*-  (l-k(i)* )  •  E|r"; 

/*  compute  a^'s  for  stage  i  */ 

*,“>  -  k(i); 

FOR  j  «-  1  TO  i*l  DO 
a/1*  ♦-  aj,_,Mc(i)  * 

FOR  j  -  I  TO  p  DO  aj  «-  a/»>>; 


Fig.  4.6.3  Durbin's  algorithm  to  compute  LPC  coefficients 
a;  from  autocorrelation  coefficients  [Yod82] 
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At  this  point,  the  beginning  and  ending  points  of  the  word  must  be  found. 
This  will  be  used  to  “warp”  incoming  words,  making  them  the  same  length 
(linear  time  warping).  In  [RaS78],  there  is  a  simple  method  for  determination  of 
the  end  points  in  an  utterance.  This  method  makes  use  of  the  energy  (  =  R (0) ) 
of  each  frame  and  the  number  of  times  the  normalized  signal  changes  sign  in 
one  frame  (zero  crossing).  It  has  been  shown,  however,  that  the  number  of  zero 
crossings  in  telephone  quality  speech  is  not  effective  in  detecting  word 
boundaries  [LaR81]. 

Since  each  word  is  surrounded  by  silence,  there  must  be  some  perturbation 
to  indicate  that  a  word  is  present.  There  are  only  two  types  of  speech,  voiced 
and  unvoiced.  From  [SaR75],  a  voiced  sound  will  result  in  a  large  energy, 
while  other  sounds  will  result  in  only  moderate  energy.  Thus,  setting  a  lower 
and  an  upper  bound  on  the  energy  will  allow  one  to  determine  the  starting  and 
ending  points  of  a  word.  The  two  energies  allow  the  determination  of  the 
existence  of  a  word  without  losing  valuable  information  contained  between  the 
lower  energy  and  upper  energy  thresholds. 

Consider  Fig.  4.6.4.  From  frame  two  to  frame  six,  the  energy  exceeds  the 
lower  energy  bound,  indicating  that  the  sound  is  voiced.  After  the  sixth  frame, 
the  energy  is  below  the  minimum  threshold  values,  indicating  that  no  speech  is 
present.  For  the  purposes  of  analysis,  two  constants  are  defined.  They  are  the 
upper  and  lower  energy  (UE  and  LE  respectively)  and  are  defined  as  follows: 

LE  =  MIN(0.03  x  (PEAK-SILENT)  +  SILENT  .  4  x  SILENT) 


UE  =  5  x  LE 


PEAK  and  SILENT  are  the  largest  energy  over  all  frames  and  largest  energy 
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over  the  ten  silent  frames  respectively.  LE  and  UE  can  be  predetermined  to 
reduce  the  calculational  load.  The  endpoint  detection  algorithm  is  as  follows 
(the  value  of  the  frame  pointer  after  the  application  of  this  algorithm  to  Fig. 
4.6.4  is  shown  in  braces): 

(1)  Measure  energy  for  every  frame  (=R(0)). 

(2)  Set  a  pointer  to  the  first  frame  that  exceeds  the  upper 
energy  threshold  {4}. 

(3)  Back  the  pointer  up  until  it  points  to  a  frame  that  does  not 
exceed  the  minimum  energy  threshold  {1}. 

(4)  Advance  the  pointer  one  frame  {2}. 

The  endpoint  is  located  by  applying  the  same  procedure  in  reverse.  This 
process  requires  300  fixed  point  comparisons  and  a  variable  number  of  integer 
operations  depending  on  word  length.  In  addition,  there  is  a  variable  number 
of  floating  point  operations  required  for  each  frame  as  the  frame  pointer  is 
backed  up. 

After  the  beginning  and  ending  points  of  the  word  are  known,  a  procedure 
called  linear  time  warping  is  applied  to  make  all  words  the  same  length.  In  this 
procedure,  a  speech  segment  containing  M  frames  of  data  is  compressed  or 
expanded  to  contain  F  frames  of  data.  This  is  done  by  applying  the  following 
equation: 

T(f)  =  (l-k)xR(m)  +  kxR(m  + 1)  f=l,...,F 

where  R(m)  l<m<M  are  the  M  frames  of  the  input  template,  T(f)  l<f<F  are 
the  F  frames  of  the  output  template,  and: 


M  =  + 1 


k  =  ,,_1)  om}  +  (1'ra) 

k  is  calculated  F  times  and  requires  one  fixed  point  multiplication,  one  fixed 
point  division,  one  fixed  point  addition,  and  four  fixed  point  subtractions.  M  is 
calculated  once  per  word  and  requires  one  fixed  point  floor,  three  fixed  point 
subtractions,  one  fixed  point  multiplication,  one  fixed  point  division,  and  one 
fixed  point  addition.  T(f)  is  calculated  F  times  and  requires  two  fixed  point 
multiplications,  one  fixed  point  addition,  one  fixed  point  subtraction,  and  three 
integer  additions  for  address  calculations.  In  this  paper,  the  value  used  for  F 
will  be  40  (frames/word).  This  represents  596  msec,  of  speech. 

From  here,  the  data  is  passed  to  the  next  stage,  where  a  process  called 
dynamic  time  warping  (DTW)  is  applied  to  each  utterance.  For  speech 
recognition  systems,  reference  patterns  (or  templates)  for  each  word  the  system 
is  to  understand  are  stored  in  memory.  DTW  attempts  to  normalize  time  in 
order  to  make  an  unknown  utterance  match  each  of  the  template  utterances, 
thus  finding  the  minimum  time-normalized  distance  between  the  unknown 
template  and  the  reference  templates.  Fig.  4.6.5  [Yod82]  shows  the  results  of 
dynamic  time  warping.  In  this  case,  each  template  is  represented  by  a  sequence 
of  feature  vectors.  Each  feature  vector  contains  the  LPG  coefficients.  Since 
linear  time  warping  has  been  applied  to  the  two  utterances,  they  are  the  same 
length.  This  reduces  the  complexity  of  DTW.  The  following  algorithm  [YoS82], 
considers  two  patterns  A  and  B,  where  A  and  B  are  sequences  of  feature 
vectors  aj  and  bj  for  l<i<I  and  l<j<J.  The  a;  and  bj  are  vectors  containing 


LPC  coefficients.  Since  linear  time  warping  has  been  performed  at  the  previous 
level,  I  and  J,  the  number  of  frames  describing  the  incoming  utterance  and  the 
known  word  template,  are  equal.  The  minimum  time-normalized  distance  is 
found  as  shown  in  Fig.  4.6.6.  This  is  accomplished  by  finding  a  path 
connecting  (1,1)  to  (I, J)  such  that  the  cumulative  distance  is  a  weighted  sum  of 
the  local  distances  d(i,j)  between  the  vectors  a;  and  bj.  d(i,j)  is  defined  to  be: 

d(i,j)  =  |  a f  -  bj2 1 

One  method  to  find  the  cumulative  distance,  g(i,j),  restricts  the  possible  path 
leading  to  a  given  point  to  those  shown  in  the  inset  in  Fig.  4.6.6.  Using  a 
recursive  definition,  g(i,j)  can  be  defined  as  follows  [YoS82]: 

g(i-l,j-2)+2d(i,j-l) 

g(i,j)  =d(i,j)  +  NUN  g(i-l,j-l)  +  d(i,j) 

g(i-2,j-l)+2d(i-l,j) 

g(l.l)=2d(l,l) 

The  result  of  the  algorithm  is  the  time  normalized  distance.  g(I,J)/(I  +  J).  Fig. 
I.ti.7  shows  a  serial  DTW  algorithm.  For  the  purposes  of  this  paper,  let  r". 
the  amount  the  algorithm  is  allowed  to  '  warp''  the  utterance,  be  3  and  J.  the 
number  of  templates  of  known  words,  be  10000.  This  will  give  the  system  a 
vocabulary  of  1000  words  (since  speaker  independent  word  recognition  requires 
ten  templates  for  each  word  in  the  vocabulary  [RalTOj),  at  a  cost  of  over  108 
index  operations.  By  choosing  the  word  corresponding  to  the  minimal  distance. 
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hold  =  °°  ;  template  =0;/*  initialization  */ 
for  k  =  1  to  10000  j  /*  for  each  template  V 
for  j  =  ~1  to  1  j  /*  initialization  */ 
for  i  =  — !  to  1  | 
g[i][j]=°°; 

d[i][j]=»; 

{  /*  end  i  */ 

J  /*  end  j  V 

for  j  =  1  to  80  |  /*  for  each  frame  in  a  and  b[k]  */ 
for  i  =  j— r  to  j-*-r  j  /*  each  frame  within  window  */ 
if(i<0)  l  =  1;  /*  force  i  to  be  valid  V 
i?(i>80)  i  =  j — r — 1 ; 
else  $ 

d[i][j]=0; 
for  h  =  1  to  9  { 

/*  compute  'distance'  between 
frames  a[i]  and  b'kjj]  */ 

d[i][j]=  d[il[  j]* 

(af ilT h] — bi  klfj  |T hl)**2; 

J  /*  end  h  */ 

g[i]ij]-min(g'i--l  ]  j-:  ]-2d[i]  j]. 

gli--l]:j-8]-2d.i]j--l]-d;i][j], 
£iri--2lrj-llj-2d:i--lli j]J-dri|  j)); 
j  /*  end  i  */ 

{  /* end  j  •/ 

D(a,b:  kj)  =  g^Ol .  80]; 

if(D(a,bLk])  <  hold)  j  /•  store  minimum  value  V 
hold  =  D(a,b  k]); 
template  =  k; 

{  /•  end  if  */ 
j  /*  end  k  */ 

a  -  unknown  word  (1’W) 

a[i]  -  frame  l  of  UW 

ali][h]  -  clement  h  of  vector  describing  frame  i  of  I'W 
b[kj  -  reference  word  k  (RWK) 
b[k][i]  -  frame  i  of  RWK 

b[kjf ijrh]  -  element  h  of  vector  describing  frame  i  of  RWK 
D(a,t/k])  -  distance  between  UW  and  RWK 
g(i,j)  -  cumulative  distance  between  a  and  b’k] 
hold  -  distance  number  or  best  fitting  reference  word 
template  -  number  of  best  fitting  reference  word 


Fig.  4.6.7  Sample  DTW  algorithm 
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this  system  will  have  speaker  independent  accuracies  up  to  98 °c  [RaL79], 


4.7.  Application  of  Theory  to  Scenario 

The  application  of  the  evaluation  categories  to  the  Z-transform  gives  the 
following  information: 


1)  Input: 

2)  Calculations: 


3)  Range/Accuracy: 

4)  Communications: 

5)  Memory: 

6)  Parallelism: 

7)  Algorithm: 

8)  Output: 


16-bit  integer/sample, 

6670  samples/second 

1  16-bit  fetch, 

1  24-bit  fixed  point  multiplication, 

1  24-bit  fixed  point  subtraction, 

1  24-bit  fixed  point  store, 

2  24-bit  fixed  point  register-register  transfers 

+  65535  to  -63000/1 
None 

Program  +  Data  +  Stack  <  2  Kbytes 
MEVlD 

Stated  above 

6670  32-bit  floating  point  numbers  per  second 


After  the  Z-transform  is  performed  by  the  first  level,  the  data  is  passed 
onto  the  next  level  where  the  windowing  is  performed.  The  computational 
requirements  of  this  level  are  as  follows: 


1)  Input: 

2)  Calculations: 

3)  Range/Accuracy: 

4)  Communications: 

5)  Memory: 

6)  Parallelism: 

7)  Algorithm: 


6670  24-bit  fixed  point  numbers/second 

3  24-bit  fixed  point  multiplication  per  number, 
6  24-bit  fixed  point  fetches/stores  per  number 

±65535/008 

Nearest  Neighbor  200 
24-bit  fixed  point  numbers 

Program  +  Data  +  Stack  <  5  Kbytes 

SIMD/MIMD 

Stated  above 


8)  Output: 


300  16-bit  fixed  point  numbers/14.9  msec. 


Note  that  this  level  performs  the  operation  of  dividing  the  data  into  300 
sample  frames  that  are  transmitted  every  100  samples  (14.9  msec).  Every  14.9 
msec,  the  300  sample  frames  are  transmitted  to  the  next  level,  where 
autocorrelation  analysis  is  performed. 

The  calculational  requirements  of  the  autocorrelation  analysis  are  as 
follows: 


1)  Input: 


2)  Calculations: 


3)  Range/Accuracy: 

4)  Communications: 

5)  Memory: 

6)  Parallelism: 

7)  Algorithm: 

8)  Output: 


300  24-bit  fixed  point  numbers/ 

14.9  milliseconds, 

300  32-bit  fixed  point  numbers  =  1  frame 

2773  32-bit  fixed  point  multiplication/frame, 
2764  32-bit  fixed  point  additions/frame, 

2782  32-bit  fixed  point  fetches(stores)/frame, 
2782  integer  additions/  frame  (indexing) 

±224/l 

None 

Program  +  Data  +  Stack  <  7  Kbytes 
\flMD 

Stated  above 

9  32-bit  fixed  point  numbers/14. 9  msec. 


The  results  of  the  autocorrelation  analysis  are  used  for  the  LPC  analysis,  where 
the  amount  of  data  is  reduced  from  300  16-bit  fixed  point  numbers 
representing  a  frame  to  9  32-bit  fixed  point  numbers.  The  requirements  of  the 
LPC  analysis  are: 

1)  Input:  9  32-bit  fixed  point  numbers  per  frame 

2)  Calculations:  18  integer  initializations, 

504  integer/index  additions/subtractions, 

72  integer  comparisons, 

8  32-bit  fixed  point  initializations. 

89  32-bit  fixed  point  assignments/stores. 


3)  Range/Accuracy: 

4)  Communications: 

5)  Memory: 

6)  Parallelism: 

7)  Algorithm: 

8)  Output: 


28  32-bit  fixed  point  additions, 

44  32-bit  fixed  point  subtractions, 

72  32-bit  fixed  point  multiplications, 

8  32-bit  fixed  point  divisions, 

553  32-bit  fixed  point  fetches 

±224/  0.008 
None 

Program  +  Data  +  Stack  <  7  Kbytes 
MIMD 

Stated  above 

8  32-bit  fixed  point  numbers/14.9  msec. 


After  the  LPC  analysis  is  performed,  the  a(i)’s  are  then  passed  onto  the 
next  level  where  endpoint  detection  is  performed.  Endpoint  detection  takes 
place  over  several  frames.  The  time  required  for  the  endpoint  detection 
algorithm  is  a  function  of  the  number  of  frames  in  a  word.  On  a  per  word  basis 
(assuming  80  frames  per  word  (1=80  and  J  =80),  there  the  calculational 
requirements  of  the  endpoint  detection  algorithm  are: 


1)  Input: 

2)  Calculations: 


3)  Range/Accuracy: 

4)  Communications: 

5)  Memory: 

6)  Parallelism: 

7)  Algorithm: 

8)  Output: 


720  32-bit  fixed  point  numbers/1. 2sec. 

1520  32-bit  fixed  point  comparisons/word, 
324  integer  increments(decrements)/word, 
4  integer  stores, 

1280  32-bit  fixed  point  fetches(stores) 
±2'4/.008 
none 

Program  +  Data  +  Stack  <  10  Kbytes 

SIMD/MIMD 

Stated  above 

640  32-bit  fixed  point  integers/word. 


These  figures  represent  maximums  because  it  is  quite  possible  to  go  for 
many  frames  without  receiving  any  input.  In  such  a  case,  it  is  possible  to  just 
require  the  comparisons  with  only  a  memory  update  to  accommodate  the 
incoming  frames. 


From  here  the  words  are  passed  on  to  the  level  that  performs  the  linear 
time  warping.  Assuming  80-frame  words  (which  are  very  long),  the  calculations 
required  for  linear  time  warping  are  as  follows: 


1)  Input: 

2)  Calculations: 


3)  Range/Accuracy: 

4)  Communications: 

5)  Memory: 

6)  Parallelism: 

7)  Algorithm: 

8)  Output: 


720  32-bit  fixed  point  numbers/I. 2  sec. 

1480  32-bit  fixed  point  fetches(stores)/1.2  sec. 
480  32-bit  fixed  point  additions/1.2  sec. 

162  32-bit  fixed  point  subtractions/1.2  sec. 
800  32-bit  fixed  point  multiplications/1.2  sec. 
40  32-bit  fixed  point  divisions/1.2  sec. 

40  32-bit  fixed  point  floor  operations/ 1.2  sec. 

±224/.008 

Global 


Program  +  Data  4-  Stack  <  5  Kbytes 

SLMD/MIMD 

Stated  above 

720  32-bit  fixed  point  integers  per  word. 


After  the  linear  time  warping  is  performed,  the  data  is  passed  on  to  the  next 
stage,  where  dynamic  time  warping  is  performed  (once  for  each  utterance  in  the 
vocabulary). 

1)  Input.  720  32-bit  fixed  point  numbers/1.2  sec. 

2)  Calculations:  6.8M  index  variable  assignments/1.0  seconds 

0.1M  index  variable  additions 
66. 1M  index  variable  additions  (  +  1) 

67. 3M  index  variable  conditional  branches 

132. 7M  address  calculations 

105. 5M  fixed  point  additions 

5.8M  fixed  point  assignments 

11. 3M  fixed  point  conditional  branches 

60. 7M  fixed  point  multiplications 

60. 7M  fixed  point  subtractions 

3)  Range/Accuracy:  ±224  /  ±1 

4)  Communications:  Global:  capable  of  recursive  doubling  [Sto79] 

5)  Memory:  Program  +  Stack  <  10  Kbytes 

Data  ( 14.5/N)  +0.01  Mbytes  per  processor  for  reference 
(template)  and  incoming  utterance  storage. 

Note:  One  copy  of  the  program  is  required  per 
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6)  Parallelism: 


7)  Algorithm: 

8)  Output: 


processor  for  an  MIMD  machine;  one  copy  is  required 
for  the  control  unit  in  an  SIMD  machine. 

MIMD 


Speedup  = 


+  (log2N)  (IC  +  2xNO) 


where  a  single  processor  takes  time  T,  IC  is  the  time 
for  an  integer  comparison,  and  NO  is  the  time  for  a 
network  operation. 

Stated  above. 


1  45-character  word 


Consider  the  derivation  of  the  last  set  of  nine  evaluation  categories.  These 
nine  evaluation  categories  represent  an  analysis  of  the  algorithm.  Evaluation 
category  2  is  directly  determinable  from  the  algorithm.  The  range  and 
accuracy  is  determinable  from  the  application.  [RaL79]  states  that  15  bits  is  a 
reasonable  wordwidth  when  the  sampling  rate  is  6.67  KHz,  as  it  is  for  this  task. 
To  apply  a  parallel  machine  to  this  algorithm,  each  PE  would  need  to  execute 
this  algorithm  on  its  own  portion  of  the  tamplate  database  computing  a  local 
D(A,B).  Recursive  doubling  [Sto79]  would  then  be  used  to  combine  the  results; 
i.e.,  the  word  associated  with  the  smallest  D(A,B)  is  the  chosen  word.  This 
requires  2  log2N  transfers  for  the  D(A,B)’s  and  the  identifiers  for  their 
associated  words. 

The  amount  of  memory  is  expressed  as  a  function  of  N,  the  number  of 
processors.  A  “C”  language  program  was  coded  and  compiled  to  estimate  the 
program  size.  The  DD  is  small,  so  either  SIMD  or  MIMD  parallelism  can  be 
applied  to  the  program;  however,  the  maximum  parallelism  is  10,000 
processors,  assuming  each  PE  executes  the  algorithm  for  one  or  more 
templates.  Application  of  N  processors  will  yield  the  speedup  shown  in  (6). 
The  output  of  this  system  is  one  word.  It  is  imperative  that  the  system  keep 
up  with  the  input;  however,  it  is  desirable  to  do  such  with  a  minimal  cost. 


The  number  of  each  calculation  can  be  multiplied  by  the  single-operand 
execution  times  of  the  tuples  for  each  processor  in  the  database.  The  sum  of 
the  products  yields  an  approximate  worst-case  execution  time  for  a  single  copy 
of  each  processor  in  the  database  to  perform  this  algorithm.  Actual  execution 
time  could  be  better  due  to  clever  software  or  special  hardware  functions.  For 
example,  software  that  is  written  to  ignore  redundant  calculations.  Also,  by 
applying  pipeline  analysis  techniques  to  this  algorithm  and  using  structural 
information  about  each  processor,  such  as  functional  overlap,  stages  in 
processing  pipelines,  and  the  multiplicity  of  units,  a  more  precise 
approximation  of  the  single  processor  execution  times  can  be  obtained. 

Based  on  the  desired  response  time,  additional  processors  of  the  same  type 
are  repetitively  added  until  a  level  composed  of  such  processors  could  meet  the 
time  requirements.  The  number  of  processors  is  then  multiplied  by  the  cost  of 
the  associated  hardware.  To  this  amount,  the  price  of  other  devices,  such  as 
memory  and  infer-processor  communications  links,  is  added  to  approximate  the 
cost  of  the  processing  hardware  involved.  The  processor  chosen  used  for  the 
design  will  be  chosen  based  on  the  least  expensive  hardware. 

Consider  the  application  of  a  Motorola  68000  to  the  above  task.  The 
tuples  enumerating  the  operations  and  their  respective  times  contains  over  1000 
instructions:  a  partial  list  is  included  for  brevity: 

{add  r.#:add  rl.r2:  idd  (a)  +  .r:cond.  branch:  mov  r.#;mov  r.(a):mov  #.(a);mul 
rl.r'2:  mul  (a)  +  .r:  sub  r.#;sub  rl.r‘2;sub  (a)+,r} 

where  r  stands  for  register,  #  stands  for  immediate,  (a)  stands  for  memory 
location  stored  in  register  "a",  (a)  +  stands  for  memory  location  stored  in 
register  "a"  followed  by  incrementing  “a". 


The  tuple  describing  the  timings  (in  cycles)  is: 

{8,4,8, 10(true)/8(false),8, 12,12,70,74,8,4,8} 

The  68000  has  a  no  functional  overlap  or  pipelining  other  than  a  five  stage 
instruction  decoder.  These  tuples  will  be  omitted.  A  68000  has  no  special 
address  calculation  hardware,  so  an  address  calculation  required  loading  a 
register,  multiplying  by  a  memory  location,  and  the  addition  of  two  memory 
locations.  Assuming  that  the  index  variables  are  stored  in  registers  and  that 
fixed  point  numbers  are  stored  in  memory,  a  12.5  MHz  68000  would  take  1579 
seconds  to  perform  dynamic  time  warping  on  a  single  word.  Using  a  multistage 
cube  network  that  takes  1.0  msec  for  two  transfers,  1600  processors  in  MIMD 
mode  would  take  .998  seconds  to  perform  dynamic  time  warping.  (A  thorough 
analysis  should  consider  the  overlap  of  CU  and  PE  operations  in  SIMD  mode; 
e.g.,  address  calculations).  Dynamic  time  warping  is  normally  done  with  fewer 
than  100  reference  templates  because  of  its  great  computational  complexity. 

Such  an  analysis  would  be  requires  for  each  processor  in  the  database. 
Then,  an  actual  implementation  of  the  above  approach  would  consider 
simulating  the  algorithm  on  the  various  processors  to  obtain  a  more  accurate 
timing  estimation.  Finally,  if  no  processor  in  the  database  could  be  used  to 
implement  this  algorithm,  the  layer  would  need  to  be  broken  down  into  sub 
layers,  each  of  which  would  be  analyzed  with  the  proposed  techniques. 
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4.8.  Conclusions 

Using  the  above  nine  categories,  an  algorithm  can  be  analyzed  according 
to  the  requirements  it  places  on  a  system.  By  building  hardware  to  efficiently 
handle  these  needs,  it  will  be  able  to  effectively  process  the  algorithm.  If  many 
hardware  components  are  analyzed  and  categorized  according  to  their  abilities 
and  processing  times,  a  database  containing  information  about  each  processor 
can  be  built.  By  mapping  the  organization  of  each  level  in  a  multi-level  design, 
computers  can  be  used  to  design  systems  for  specific  needs  of  algorithms.  Thus 
automated  design  of  special  purpose  processors  can  be  achieved. 

In  summary,  this  was  a  preliminary  study  of  how  to  partially  automate 
and  model  the  design  of  special  purpose  systems.  Categories  of  hardware 
analysis  were  presented.  Their  relationship  to  the  hardware  and  their 
dependence  on  the  software  was  discussed.  An  application  of  the  theory  to  a 
software  scenario  performing  speaker  independent  isolated  word  recognition 
was  presented.  Finally,  the  computational  requirements  of  the  scenario  were 
presented.  By  bridging  the  gap  between  hardware  and  software,  automated 
special  purpose  machine  design  comes  closer  to  being  a  reality. 


CHAPTER  5 

ASYNCHRONOUS  AND  SYNCHRONOUS 
SYSTEMS  ADVANTAGES  AND 
DISADVANTAGES 

5.1.  Introduction 

Chapter  4  introduced  a  scheme  for  modeling  the  hardware  requirements  of 
a  layer.  It  also  proposed  a  concise  scheme  for  modeling  the  capabilities  of  a 
computational  device.  Finally,  it  showed  a  method  of  going  from  the  hardware 
requirements  of  a  task  to  the  computational  device.  This  was  all  done  with  a 
real-time  system  in  mind.  The  type  of  system  considered  in  Chapter  4  was  a 
synchronous  macro-pipelined  system  with  potential  parallelism  at  each  level. 
This  chapter  looks  at  the  performance  of  an  asynchronous  macro-pipelined 
system,  a  synchronous  macro-pipelined  system  with  triple-buffers  between 
levels,  and  a  synchronous  macro-pipelined  system  with  double-buffers  between 
levels.  It  is  the  goal  of  this  chapter  to  show  the  strong  and  weak  points  of  each 
of  these  schemes,  along  with  showing  in  which  situations  each  of  these  schemes 
is  most  applicable. 

Before  considering  the  use  of  asynchronous  stages  in  the  proposed  macro- 
pipelined  architectures,  analysis  techniques  to  determine  inter-level  buffer  size, 
expected  process  wait  time,  and  the  likelihood  of  buffer  overflow,  are  required. 
If  analysis  techniques  cannot  be  developed,  the  use  of  asynchronous  stages  will 
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be  complicated  bee  luse  no  analysis  techniques  short  of  simulation  will  provide 
meaningful  results.  It  is  the  goal  of  this  chapter  to  determine  whether  it  is 
possible  to  derive  the  following  parameters  about  a  (possibly  parallel)  level: 

I  Probability  c  f  receiving  v  jobs  by  some  time  t 

(Pv(t)) 

II  The  expected  average  input  queue  length  (Q)  for  a  given  layer 

Pv(t)  is  useful  in  determining  the  most  likely  time  at  which  v  jobs  will 
arrive  at  a  given  level,  which  is  receiving  input  (which  may  include  feedback 
from  later  levels)  from  r  sources.  Integrating  tPv(t)  with  respect  to  time  will 
yield  the  expected  time  at  which  v  arrivals  will  occur.  Within  some  tolerance 
(e.g.,  ±  a  standard  deviation),  this  is  required  to  determine  the  required 
throughput  of  a  given  level.  Q  is  required  to  determine  the  time  that  a  job 
spends  “waiting”  to  get  processed.  This  time  must  be  added  to  the  total 
computation  time  for  a  job  to  determine  the  total  time  required  to  complete  a 
job. 

In  the  previous  discussion,  each  level  was  allowed  to  process  only  one  data 
set  at  a  time.  This  was  a  restriction  imposed  by  the  synchronous  nature  of  the 
system.  When  the  levels  of  a  system  are  asynchronous,  this  restriction  could  be 
removed.  For  example,  a  level  could  contain  multiple  processors,  each  working 
on  a  different  data  set.  For  the  purposes  of  this  study,  however,  this  restriction 
will  still  be  imposed.  As  shown  in  Fig.  5.1.1  (a),  it  will  be  assumed  that  all 
processors  in  a  given  level  work  together  to  process  a  single  data  set.  There  is 
a  single  input  queue  for  each  level.  This  form  of  replication  corresponds  to 
either  the  SIMD  or  MINI!)  parallelism  discussed  in  the  previous  section.  Only  a 
single  result  is  completed  at  a  time  by  a  level.  Outgoing  jobs  are  queued  (if 
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necessary)  in  the  input  queue  for  the  next  level. 

There  are  other  types  of  parallelism,  such  as  the  multiprocessors  with 
multiple  data  sets  mentioned  above,  which  will  not  be  considered  here. 
Instead,  their  analysis  will  be  left  as  future  work. 

Fig.  5.1.1  (b)  shows  two  asynchronous  systems  with  feedback.  For  the 
purposes  of  this  research,  feedback  is  defined  to  be  any  data  set  in  the  input 
queue  for  a  level  i  that  did  not  come  from  level  i-1  .  By  this  definition, 
feedforward  (from  levels  other  than  i~l)  is  also  treated  as  feedback,  (from 
levels  other  than  i—  1 )  is  treated  as  feedback.  Here,  it  is  assumed  that  feedback 
data  sets  can  arrive  asynchronously  and  are  normal  data  sets  as  far  as  size  and 
processing  requirements  are  concerned.  Feedback  may  be  required  when  a  data 
set  needs  further  reprocessing,  eg.,  processing  with  different  parameters 
because  it  is  later  found  that  some  criterion  is  not  met.  Feedforward  may  be 
used  when  a  particular  data  set  does  not  need  to  be  processed  by  a  specific 
level  or  levels.  Synchronous  systems,  by  their  nature,  cannot  have  feedback. 

Initially,  four  assumptions  will  be  made.  They  are: 

(1)  Two  input  data  sets  cannot  arrive  at  a  given  level  simultaneously  i.e.. 
feedback  is  not  allowed.  (If  there  is  no  level-to-level  feedback,  it  is 
impossible  for  multiple  data  sets  to  arrive  at  a  given  level 
simultaneously.)  This  restriction  will  be  removed  later. 

(2)  At  a  given  level,  the  arrival  of  a  particular  data  set  is  independent  of  the 
arrival  of  any  other  data  set.  Thus,  the  arrival  of  data  sets  to  be 
processed  by  a  level  can  be  treated  like  events  in  a  Poisson  random 
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(3)  At  a  given  level,  the  average  arrival  rate  of  data  sets  to  be  processed  is 

V 

(4)  The  probability  of  an  arrival  at  a  given  level  during  a  given  time 
interval  is  a  function  of  the  interval  duration,  not  the  beginning  time  of 
the  interval.  For  a  very  small  time  interval  At,  the  probability  of  an 
arrival  is  AarAt. 

Section  5.2  discusses  the  determination  of  Pv(t)  for  a  single  input  stream, 
single  processing  stream  system;  i.e.,  no  feedback.  The  information  presented 
in  Section  5.2  represents  a  derivation  of  the  results  presented  in  [Ful75].  It  is 
the  purpose  of  this  section  to  provide  necessary  background  information  to 
clarify  the  discussion  of  topics  appearing  elsewhere  in  this  chapter.  Section  5.2 
also  states  the  results  of  the  theory  for  a  multiple  input  stream  case;  i.e. 
feedback  is  allowed.  Section  5.3  continues  the  derivation  started  in  Section  5.2 
to  determine  the  expected  size  of  a  level-level  queue. 

Based,  on  the  previous  sections,  Section  5.4  compares  the  performance  of 
an  asynchronous  system  with  the  performances  of  the  both  double-  and  triple- 
buffered  synchronous  systems.  In  Section  5.4.3  the  performance  of  both 
double-  and  triple-buffered  synchronous  systems  are  analyzed  for  two  level 
systems  where  the  response  time  of  the  first  level  is  fixed  and  the  response  time 
of  the  second  level  is  either  a  uniform  random  variable  or  a  Gaussian  random 
variable.  Section  5.4.4  applies  the  techniques  presented  in  Section  5.3  to  derive 
the  expected  throughput  and  response  time  of  an  asynchronous  system. 
Section  5.4.5  contrasts  the  performance  of  the  two  synchronous  systems  and 
the  asynchronous  system  when  the  response  times  of  the  two  levels  are  random 
variables.  Q  for  an  asynchronous  system  is  discussed  in  Section  5.4.6.  Section 
5.4.7  contains  a  discussion  of  the  applications  of  double-  and  triple-buffering 


systems.  The  advantages  and  disadvantages  of  synchronous  systems  and 
asynchronous  systems  are  summarized  in  Section  5.4.8. 

Sections  5.2,  5.3,  and  5.4  all  deal  with  the  theoretical  expectations  of  the 
system  throughput.  To  verify  the  results  presented  in  these  sections,  Section 
5.5  presents  results  obtained  from  simulation. 


5.2.  Determination  of  Pv(t)  For  a  Single  Input/Processing  Stream 

The  following  derivation  is  similar  to  that  in  [Ful75].  Setting  At  as  the 
time  interval  under  consideration,  the  probability  of  a  data  set  arriving  is: 

P>,(At)  =  X3,  x  At  +  P >s(At) 

where  P>2(At)  is  the  probability  of  at  least  two  data  sets  arriving.  Pv( At )  for 
v>l  is  zero  if  the  previous  layer  produces  at  most  one  result  at  a  time,  if  there 
is  no  feedback,  and  if  At  is  short  (i.e.,  At  x  Xar  «  1). 

The  case  for  v  >  0  arrivals  during  a  time  interval  t  +  At  is: 

Pv(t  +  At)  =  Pv_,(t)  x  P [(At)  +  Pv(t)  x  P0(At) 

During  one  time  epoch  (At),  at  most  one  arrival  can  occur.  Thus,  either  zero 
jobs  or  one  job  can  arrive  during  the  interval  At.  Pv(t)  can  be  obtained 
through  the  fundamental  definition  of  differentiation: 


Since  Pj(At)  =  XarAt.  P0(At)  =  1  -  Pj(At)  =  1  -  XarAt.  Thus, 

Pv(t)  =  Pv-!(t)Xar  -  Pv(t)Xar  (v>0) 


Taking  the  Laplace  transform  of  this  equation  (assuming  that 

lim  Pv*0(At)  =  0),  yields: 

M—o 

1  v  X  . 

p  (s)  =  - - -  bp  rod - — 

V  (s  +  Xar)  V=i  (s  +  Xar) 

Taking  the  inverse  Laplace  transform  yields: 

(Xart)v 

Pv(t)  =  e 

The  application  of  the  above  equation  is  limited  to  a  system  with  a  single 
input  stream  capable  of  sending  one  job  at  a  given  time  and  a  single  processing 
stream  producing  a  single  result.  This  type  of  analysis  makes  it  possible  to 
determine  the  probability  of  a  level  receiving  a  given  number  of  arrivals  by  a 
certain  time  t. 

By  applying  the  results  in  [Cin75j,  Pv  for  r  independent  Markovian 


streams  is: 


(ALt)v  , 

p,(»)  =  Ai‘ 


where  Ai  is:  V'X:  and  X:  is  the  arrival  rate  for  stream  i. 


5.3.  The  Expected  Queue  Size  of  an  Asynchronous  System 


The  above  results  can  be  used  to  calculate  the  Expected  Interarrival 

Time  (EIT)  as  follows: 


Pjaverage  interarrival  time<t 


l-Po(t) 


taking  the  derivative  yields: 


P 


average  interarrival  time  — t 


=  “  Po(t) 


where,  pv(t)  -  Pv(t) 


-vx.t 

Po(t)  =  e  i=‘  X 


E  ~xi 

i=l 


Thus, 


*■-*•**  -4 
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.*•  .**  a 


r 

l  =  oo  -Vx  t 

EIT  =  (-V  X;)  /  e  '  t  dt 

i=  1  t=0 


=  V-(S\<  +  ».L  =t 


Service  of  the  arriving  jobs  is  less  complex  because  it  is  a  valid  assumption 
that  only  one  job  may  be  removed  from  the  queue  at  a  time.  Because  data  sets 
may  be  related,  the  servicing  of  the  data  sets  is  not  necessarily  Markovian.  By 

defining  AL  to  be  —7-  for  jobs  arriving  at  level  L,  /rL  to  be  tlie  average 
EIT 

,  ^  ,  standard  dev  ation  of  service  time  iL 

throughput  of  level  L,  and  C  to  be - : - : — 7 -  ,  the 

expected  service  time 

PoIIaczek-Khinchine  formula  [Ful75j  along  with  work  from  [CoM67],  can  be 
applied  to  determine  the  expected  queue  length  as: 

-  _  2AL(/iL  -  Al)  +  Al2(C2  +  1) 

This  is  the  expected  queue  length  of  an  M/G/l  queuing  structure 
(Markovian  arrival  process,  General  distribution  service  structure,  1  processor 
serving  queue). 

5.4.  A  Comparison  of  Synchronous  and  Asynchronous  Systems 
5.4.1.  Introduction 

Since  Pv(t)  and  Q  can  be  calculated,  the  throughput  and  response  time  of 
a  system  with  asynchronous  levels  can  be  compared  with  a  system  whose  levels 
are  running  in  synchrony.  Several  metrics  must  be  considered  to  perform  this 
comparison.  While  such  metrics  as  expected  queue  size,  wait  time  (in  the 
queue),  and  expected  run  time  all  have  a  meaning  for  an  asynchronous  system, 
their  use  for  a  synchronous  system  is  limited.  Worst  case  speed  in  an 
asynchronous  system  reflects  it  sell  in  the  expected  size  of  the  buffer  between 
two  asynchronous  levels,  but  does  not  bear  the  same  significance  in  a 
synchronous  system,  where  it  is  used  to  calculate  the  run  time  of  a  system. 


5.4.2.  Initial  System  Models  --  Three  Potential  Architectural  Schemes 

Consider  the  proposed  systems  shown  in  Fig.  5.4.2. 1.  Each  of  the 
proposed  systems  contains  two  levels.  This  model  can  be  extended  to  systems 
of  multiple  (>2)  levels,  by  repetitively  analyzing  the  system  in  terms  of  two 
level  pieces.  This  is  an  iterative  process.  For  example:  for  an  L  level  system, 

all  levels  2i  and  2i  +  l  (i  <  ^-)  would  be  analyzed  as  two  level  systems.  Then, 

the  statistics  for  these  systems  (consisting  of  two  levels)  would  be  combined  in 
groups  of  two.  The  resulting  analysis  would  then  parameterize  the  performance 
of  the  four  level  “systems.”  This  process  can  be  repeated  until  there  is  one  set 
of  parameters  to  describe  the  throughput  of  an  entire  system.  Because  of  the 
simplicity  of  analysis  and  applicability  of  the  analysis,  only  two  level  systems 
are  considered  here. 

The  first  system  in  Fig.  5.4.2. 1  is  a  synchronous  double-buffered  system, 
the  second  a  synchronous  triple-buffered  system  (as  discussed  in  Section  4.5), 
and  the  third  an  asynchronous  system.  It  will  be  assumed  that  the  both  of  the 
synchronous  systems  are  of  the  type  where  both  levels  report  to  an  arbitrator 
when  they  have  completed  processing  (the  first  level  to  report  waits  until  the 
last  level  reports).  It  is  the  goal  of  this  discussion  to  relate  the  response  time, 
throughput,  and  memory  requirements  of  the  three  types  of  architectures.  To 
this  end,  the  discussion  will  assume  that  the  first  level  can  perform  its 
calculations  in  a  fixed  amount  of  time  (this  restriction  will  be  removed  later). 
It  will  be  assumed  that  the  exact  execution  time  for  the  second  level  is 
unknown,  but  that  it  can  be  described  probabilistically.  This  is  similar  to  the 
earlier  discussion  about  the  isolated  speech  recognition  system,  in  which  the 
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fourth  level  required  a  fixed  computational  time  and  the  fifth  level  required  a 
variable  time. 


5.4.3.  Analysis  of  Synchronous  Models  with  Two  Probabilistic  Models 

If  tj  is  the  actual  time  that  level  i  requires  to  process  a  given  data  set  and 
prj(t)  is  probability  that  level  i  will  process  any  data  set  in  time  t,  the  expected 
processing  time  (EPT)  of  each  level  for  the  synchronous  systems  can  be 
defined  as  follows: 

t=ti  t=oo 

EPT  =  /  tj  pr2(t)dt  +  /  t  pr2(t)dt 

t=o  t=t, 

Since  the  system  is  running  in  synchrony,  the  faster  level  must  wait  for  the 
slower  level  to  respond  before  its  processing  can  continue.  The  addend  (first 
term  in  the  sum)  represents  the  time  that  the  system  will  spend  when  the 
second  level  responds  more  quickly  than  the  first.  Here,  the  response  time  of 
the  system  is  tj.  The  probability  that  the  response  time  of  the  system  is  t,  is 
equal  to  the  sum  of  the  probabilities  of  all  cases  where  t2  is  less  than  tj.  hence 
the  integral. 

The  augend  (second  term  in  the  sum)  results  from  the  second  level 
responding  more  slowly  than  the  first.  In  this  case,  the  response  time  of  the 
second  level  dictates  the  response  time  of  the  entire  system,  thus  the  t  times 


pr2(t)  is  the  expected  response  time  of  the  system  when  the  second  level  of  the 
system  responds  more  slowly  than  the  first  level. 


In  a  synchronous  environment,  the  minimum  time  that  a  data  set  can 
spend  at  a  given  level  is  tj.  The  processing  time  for  data  set  D  is: 

max|tj(for  D),t2(for  D— 1) j  +  max|tj(for  D  +  l),t2(for  D)j 

For  the  double-buffered  system,  the  expected  system  response  time  (SRT) 
is:  2  x  EPT.  In  general,  this  is  Nl  x  EPT,  where  N],  is  the  number  of  levels  in 
the  system.  The  throughput  of  this  system  is  1/EPT.  In  contrast,  the  triple- 
buffered  system  requires  time  EPT  to  transfer  the  data  set  from  one  level  to 
the  next,  thus  SRT  of  the  triple  buffered  system  is:  2  x  EPT  (one  time  unit  is 
required  to  load  data  into  a  level  and  one  time  unit  is  required  to  process  the 
data).  For  the  two  level  case  considered  here,  this  is:  4NL  x  EfT.  The 
throughput  of  the  triple-buffered  system  is  the  same  as  the  double-buffered 
system. 

Analysis  of  the  cases  where  the  levels  are  running  in  synchrony  ( i.o. .  all 
levels  must  complete  their  present  data  set  before  any  can  go  onto  the  next 
data  set)  can  be  obtained  by  applying  this  (previously  mentioned)  equation: 

t=ti  t=oo 

EPT  =  J  tjpr2(t)dt  +  J  tpr2(t)dt 

t=o  t=t, 

The  addend  of  EPT  is  evaluated  as  follows  (where  level  two  is  Gaussian  with 
mean  response  time  and  standard  deviation  of  t2  and  <r2  =  Ct2  respectively): 

t=tj 

f  tjpr2(t)dt 


<l>  is  the  Gaussian  probability  function  with  mean  0  and  standard  deviation  1 
[Pap65].  <r2  was  set  to  C’t2  so  that  the  results  could  be  expressed  in  terms  of  t,. 


For  this  last  equation  to  hold,  the  quantity:  1  -  $ 


must  be  zero.  For 


values  of  C  larger  than  0.4,  a  Gaussian  distribution  function  would  require 
some  modifications  (e.g.,  a  S  function  for  pr2(0))  to  be  valid.  When  to  =  btq, 
this  equation  simplifies  to: 


1  -  b 
Cb 


The  augend  of  EPT  is  evaluated  as  follows: 


t  =  00 

f  tpr2(t)dt 


by  defining  u  =  t  -  t2  (du  =  dt),  this  simplifies  to: 


Allowing  t2  =  btj  yields: 


exp 


1  -  4> 


v/2jt 


2(Cb)2 


+  bt, 


I 


Ub 


JJ 


The  resultant  equation  for  EPT  (for  a  Gaussian  time  distribution)  is  as  follows: 


EPT  =  t,(l  -  b) 


yfj 

1  -b 

Cbt, 

_1_  v 

-(1  -  b)2 

Cb 

exp 

2(Cb)2 

+  bt, 


Here, 


SRT(double  buffered)  =  2xEPT 


SRT(triple  buffered)  =  4xEPT 
and  the  expected  system  throughput  (ST) 


ST  = 


1 

EPT 


Table  5.4  3.1  shows  the  effect  of  b  and  C  on  SRT  and  the  system 
throughput  ST  when  the  processing  time  of  level  two  can  be  represented  by  a 
Gaussian  distribution  function.  In  addition,  Table  5.4.3. 1  shows  that  the 
greater  the  probability  that  t2  is  greater  than  t,,  the  lower  the  throughput.  In 
addition,  a  triple-buffered  system  will  have  the  same  throughput  as  a  double- 
buffered  system,  but  the  triple-buffered  system  will  require  one  extra  delay  for 
each  level  in  the  system. 

Consider  the  case  where  the  response  time  of  level  two  can  be  described  by 
a  uniform  distribution  function.  (For  this  discussion,  it  is  assumed  that  it  is 
possible  for  t2  to  be  larger  that  t,.  If  this  is  not  the  case,  EPT  =  t,.)  Again,  let 
t2  =  btt  and  a  =  Ct2.  If  t2(max)  and  t2(min)  are  the  largest  and  smallest 

response  times  respectively,  then  pr^(t)  =  - - - .  From  [Pap65],  it 

to(max)-t2(min) 


can  be  shown  that: 


__  t2(max  )-to(min)  _ 
o  —  , —  —  C  to 

v/12  2 


Given  the  the  above,  the  equalities: 

U(max)  -  t2(min) 

t2(max)  =  t2  +  — - 

2 

_  to(max)  -  t2(min) 

t2(min)  =  t2 - - - 

2 

and  t2  =  bt,,  it  can  be  shown  that: 

t2(max)  =  t,  (b  +  \/3Cb) 

to(min)  =  t,  (b  -  \/3Cb) 

Since  only  non-negative  values  of  time  are  allowed,  C  < 
determined  through  the  following  derivation. 

ti  t  = 

EFT  =  f  tj  pr2(t )dt  4-  J  t 

t  =  tj(min)  t  =  t. 


-j=r.  The  EPT  can  be 

Vs 


pr2(t)dt 
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t,  t  =  tt(b  +  \/3Cb) 

/  p^tjdt  +  /  t  pr2(t)dt 

t  =  tt(b  -  v'fcb)  t  =  t, 


t-t2(min) 


Completing  the  integration  and  using  the  fact  that  PR2(t)  — 

(PR  2(t)  is  the  probability  that  the  response  time  of  level  two  is  less  than  or 


equal  to  t)  yields: 


t,  (1  -  b  +  V3Cb)  (  (b2  +  v/l2Cb2  +  3C2b2  -  l)t! 
\/l2Cb  2\/l2Cb 


This  simplifies  to 


_  (1  -  2b  +  2\/3Cb  +  b2  +  2v/3Cb2  +  3C2b2)t, 

EPT  =  - - 7=— - 

4v/3Cb 


Table  5. 4. 3. 2  shows  the  expected  response  times  and  throughput  of  systems 
whose  response  times  can  modeled  by  uniformly  distributed  random  variable. 


5.4.4.  Analysis  of  an  Asynchronous  System  —  Two  Probabilistic 
Models 

Now,  consider  the  case  where  the  levels  operate  asynchronously.  is 
assumed  to  be  constant,  as  in  the  synchronous  case.  If  t2  can  never  exceed  tj, 
EPT  is  tj.  For  all  such  cases,  a  queue  of  length  1  is  sufficient.  In  general,  if  X 
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is  the  arrival  rate  of  jobs  from  the  previous  level 


*1 


and  fi  is  the  rate  at 


which  the  present  level  processes  data  sets 


1 

00 

/  tp2(t)dt 
t=o 


then  the  expected 


queue  size  for  an  M/G/l  system  can  be  determined  by  the  following  equation 
[Che80j: 


X 

/i  -  X 


Clearly,  as  //  approaches  X,  the  expected  queue  size  gets  arbitrarily  large.  If  to 
exists  over  a  finite  range,  then  t2(max)  may  be  substituted  for  the  oc.  This 
equation  simplifies  to: 


t  J  to 


The  expected  waiting  time  in  the  queue  can  be  determined  by  the  equation 
[F ul75] :  W  =  t2xQ.  Table  5.4.4. 1  shows  the  expected  queuelength  (Q),  the 
expected  waiting  time  in  the  queue  (\V),  the  expected  system  response  time 
(SRT).  and  the  expected  system  throughput  (ST)  as  a  function  of  t2  when  the 
response  time  of  level  1  is  a  constant  t,.  For  the  calculation  of  this  table,  it 
was  assumed  that  the  data  arrived  at  level  1  no  faster  than  one  job  per  time  t,. 
i.e.,  the  system  could  keep  up  with  the  incoming  data.  This  is  the  same 
assumption  made  for  the  synchronous  case. 
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Table  5.4.4.I. 


O 


The  expected  queue  size  (Q).  expected  time  spent  in  a  queue  (VV),  the  expected 
system  response  time  (SRT),  and  the  expected  system  throughput  (ST)  for  an 
asynchronous  system  with  tt  fixed  and  t2  an  arbitrary  random  variable. 


VV.v, 

y-A 


SRT  for  the  asynchronous  system  is  be  calculated  as  follows: 

SRT  =  t,  +  t2  +  W 

When  results  of  this  analysis  are  compared  with  the  previous  results,  the 
asynchronous  system  will  yield  up  to  a  18^o  greater  throughput.  Using 
asynchronous  hardware  will  provide  greater  utilization  of  the  hardware. 
Further,  the  asynchronous  system  will  not  need  hardware  to  transfer  data  from 
one  swinging-buffer  to  another,  unlike  the  triple-buffering  scheme. 

For  the  applications  discussed  earlier,  it  would  seem  that  the  asynchronous 
systems  are  “the  way  to  go.”  While,  in  general,  they  are  feasible,  there  are 
specific  cases  where  it  may  not  be  advantageous  to  use  an  asynchronous 
system;  e.g.,  when  response  time  is  critical.  While  an  asynchronous  system  has 
a  higher  average  throughput,  for  specific  data  sets  there  may  be  a  significant 
delay  caused  by  time  spent  in  a  queue.  The  worst-case  response  time  of  an 
asynchronous  system  could  be  greater  than  some  threshold.  In  such  an  event, 
an  asynchronous  system  would  not  be  desirable.  On  the  average,  however, 
asynchronous  systems  offer  higher  throughput  than  their  synchronous. 


5.4.5.  Analysis  of  Systems  Composed  of  Two  Levels  Whose  Response 
Times  Are  Random  Variables 

The  previous  discussion  held  t,  to  be  a  fixed  entity.  Now,  consider  the  case 
where  t1  can  vary.  It  will  be  assumed;  however,  that  tj  is  Markovian.  For 
synchronous  systems,  the  analysis  must  be  divided  into  three  parts  because  of 
the  three  distinct  ways  that  the  times  for  the  levels  can  be  related.  These 
three  cases  are  shown  in  Fig.  5.4.5. 1.  (The  dashed  line  represents  the  range  of 
processing  times  for  t2  and  the  solid  line  represents  the  range  of  processing 
times  for  tj.)  For  the  first  case,  EPT  is  tj.  This  the  least  complex  and  least 
useful  situation.  Clearly,  if  there  is  no  overlap  between  the  processing  times  of 
the  levels,  one  of  the  levels  is  processing  more  quickly  than  is  needed; 
consequently,  the  stages  of  the  pipeline  are  not  balanced.  Thus,  the  faster  level 
could  be  built  with  slower  and  presumably  less  expensive  hardware. 

Now,  consider  the  case  where  the  time  span  for  tj  overlaps  with  t2. 
Assume  that  the  maximum  and  minimum  possible  values  for  t;  are  t^max)  and 
t;(min).  EPT  is  (recall  pr^t)  is  the  probability  that  level  i  will  have  response 
time  t  and  that  PRj(t)  is  the  probability  that  level  i  will  respond  faster  than 
time  t): 

ti(min)  t^mix) 

EPT  =  f  tt  pr2(t)dt  +  J  t  pr2(t)  PRilt,  <  t)dt 

t=tj(min)  t=t1(min) 


tj(max)  tj(max) 

+  /  t  prj(t)  PR2(t2  <  t)dt  +  f  t  pr,(t)dt 

t  =t,(min)  t=t^mix) 


This  can  be  transformed  into  the  following  equation: 


Non-overlapping  times 


Overlapping  times 


EPT  =  J  t,  pr2(t)dt  +  J  t  pr2(t)  J  prj(x)  dx  dt 


t=t2(min) 


t=t,(min) 


x  =  tl(min) 


+  J  t  pr,(t)  J  pr2(x)  dx  dt  4-  J  t  prj(t)dt 

t=t1(min)  x  =  ti(min)  t=tj(max) 


Because  of  the  complexity  of  this  integral,  its  value  must  be  calculated 
numerically  if  the  distribution  of  prt(t)  or  pr2(t)  is  Gaussian.  For  the  following 
discussion,  assume  that  pt(t)  and  p2(t)  are  both  uniform  distributions.  Recall 

that:  pr;(t)  =  — - - — - — - — —  that  Umax)  is  t:(l  +  \/3C)  and  that  t:(min)  is 

tj(max)  -  ti(min) 

F;(l  -  \/3C).  If  the  standard  deviation  and  mean  of  level  1  are  <Tj  and  Ij 
respectively  and  the  standard  deviation  and  mean  of  level  2  are  cr2  and  t2 
respectively,  the  following  equation  is  the  evaluation  of  the  above  integral.  C; 

<J\  to 

is  defined  to  be  —  and  b  is  ~r~.  Again,  C;  can  be  no  greater  than  .577  if  only 

tj  1 1 


positive  values  of  t;  are  to  be  allowed. 


EPT  = 


t^fl-v^C.Hb-b^Co) 

2b\/3Co 


t ,  2b3(  1  +  v/3C2)3  +  ( l-v/3C,)3-3b2(  l  +  ^3C2)2( l-v^C,) 


36C,Cob 


A 


( 1  +  v/3C!)2-(b  +  bC2\/3) 


Table  5.4.5. 1  applies  this  equation  to  derive  the  expected  response  time  and 
system  throughput  of  a  double-  and  triple-buffered  synchronous  system  as  a 
function  of  b,  Cj,  and  C2. 

The  final  case,  where  one  interval  is  contained  in  the  other,  is  similar  to 
the  previous  case.  EPT  can  be  defined  as  follows  (when  the  possible  response 
times  of  level  2  are  a  subset  of  those  for  level  1): 

tj(min)  t^max)  t 

EPT  =  /  t2  prj(t)dt  +  f  t  pr2(t)  J  prj(x)  dx  dt 

t=ta(min)  x^^min) 


+  /  t  pri(t)  /  pr2(x)  dx  dt  +  ft  prj(t)dt 


t=t2(min) 


x  =t2(min) 


t=tu(max) 


The  analysis  for  this  case  is  similar  to  the  previous  analysis  and  it  is  omitted 
here  for  brevity. 

Now  consider  an  asynchronous  system  with  the  same  two  levels.  For  this 
analysis,  level  1  is  assumed  to  be  Markovian.  Any  probability  function  that 
can  describe  the  response  time  of  level  2  will  be  allowed.  If  level  1  runs 
continuously  then  application  of  the  Pollaczek-Khinchine  formula  predicts  the 
expected  queue  length  to  be  [Ful75]: 
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Here,  C  =  cr2  is  the  standard  deviation  of  the  processing  time  for  level  2. 
t2 

Multiplying  Q  by  t2  yields  the  expected  wait  time  in  the  queue.  Thus,  for  an 
asynchronous  system  the  expected  SRT  is: 


—  t.  +  [to  + 


A  J__A  +  i! 

tj  t2  tj 

_2_  J _ 1 

to  to  tt 


The  expected  throughput  of  the  asynchronous  system  is: 


ST  = 


max(tj  to) 


The  memory  requirements  of  the  double-buffered  system  are  2N'l  data  sets 
(Nl  is  the  number  of  levels  in  the  system),  while  the  memory  requirements  of 
the  triple-buffered  system  is  3NL  data  sets.  Finally,  allowing  a  double  input 
buffer  for  the  first  level  of  the  asynchronous  system,  its  memory  requirements 
are  2  +  (N\  -  1)Q  data  sets. 

To  relate  the  response  time  and  throughput  of  the  various  systems,  there 
are  three  cases  that  can  arise.  I2  can  be  less  than,  equal  to,  or  greater  than  tj. 
Applying  the  last  case  to  the  asynchronous  system;  it  is  assumed  that  a  new 
data  set  is  arriving  every  tj  seconds,  thus  the  queue  for  level  2  would  be 
required  to  grow  without  bound  --  clearly  not  feasible.  Synchronous  systems 


m 
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would  lose  data  sets  in  this  event.  A  similar  situation  arises  for  the  cases  where 

to  —  tj. 

The  following  analysis  will  show  how  the  three  architectures  behave  when 
to  <  tj.  The  distributions  used  to  describe  t2  will  be  the  normal  distribution 
(Gaussian)  and  the  exponential  distribution.  Mean  values  of  t2  considered  will 
be  .5t|  and  .75tj. 

Figs.  5. 4. 5. 2,  5. 4. 5. 3,  and  5. 4. 5. 4  show  the  effects  of  t2  and  C  on  the 
expected  queue  length  for  the  second  level  of  the  asynchronous  system.  Fig. 
5. 4. 5. 5  shows  the  effects  of  C  on  the  expected  queue  length  of  an  asynchronous 
system.  C  and  I2  affect  the  time  that  a  job  spends  waiting  in  a  queue.  For 
systems  in  which  the  ratio  of  the  standard  deviation  to  the  mean  (C)  is  small, 
the  expected  response  time  of  level  two  does  not  have  to  be  as  fast  as  would  be 
required  by  larger  values  of  C.  Note:  values  of  C  larger  than  one  are 
meaningless  because  such  a  condition  implies  that  negative  processing  times  are 
possible  for  level  two.  Table  5. 4. 5. 2  shows  the  response  time  and  throughput  of 
an  asynchronous  system  where  each  data  set  is  processed  only  once,  i.e.,  where 
data  sets  are  not  fed  back  for  more  processing  (  W  is  the  expected  wait  time  in 
the  queue,  and  ST  is  the  expected  system  throughput): 

This  table  is  true  when  the  service  distribution  of  the  for  the  second  level 
is  a  general  distribution.  For  an  exponential  distribution,  C  is  defined  to  be 
one.  The  results  printed  in  this  table  merit  discussion  because  they  are  not 
intuitive.  Consider  the  following  diagram: 


Table  5.4.5.2. 


It  would  seem  to  be  reasonable  no  queue  were  needed  between  these  two  levels, 
since  on  the  average  the  second  level  completes  its  processing  faster  than  the 
first  level:  however,  consider  the  case  where  level  two  requires  1.25xtt  to 
complete  two  adjacent  data  sets  and  the  normal  .95xF,  to  complete  the  next 
four  data  sets,  and  65xt,  to  complete  the  final  two  data  sets.  The  average 
queuelength  is  approximately  1.  With  a  wide  variation  of  response  times  and 
many  data  sets,  this  can  cause  the  expected  queuelength  to  grow. 

If  the  first  level  completes  processing  on  several  consecutive  data  sets 
faster  than  t,  then  the  data  sets  need  to  be  queued  for  the  second  level.  As  U 
approaches  t , ,  the  likelihood  of  the  first  level  producing  large  numbers  of  jobs 
faster  than  the  second  level  can  handle  them  increases. 


5.4.6.  Analysis  of  Q 


When  considering  the  expected  queue  size,  the  expected  wait  time  in  the 
queue,  and  the  system  response  time  of  an  asynchronous  system,  the  previous 
discussion  assumed  that  there  would  be  enough  buffer  memory  to  hold  all  the 
data  sets.  The  expected  queue  size  is  the  probabilistic  term  for  the  average 
queue  size.  Thus,  if  a  buffer  memory  size  equal  to  the  average  queue  size  is 
used,  the  probability  of  overflow  is  0.5.  At  this  point,  no  data  sets  can  be 
taken.  Such  an  event  at  level  I  will  cause  processing  to  stop  at  level  I- 1  when 
level  1-1  attempts  to  send  its  results  to  level  I.  Rapidly,  this  will  cause  the 
input  queue  for  level  1-1  to  fill,  halting  level  1-1.  Thus,  all  levels  in  the  system 
will  process  data  sets  at  the  rate  of  level  I,  which  is  the  slowest  level  in  the 
system.  This  effectively  slows  the  asynchronous  systems  processing  rate  down 
to  the  rate  of  a  synchronous  system. 

It  should  be  remembered  that  where  an  asynchronous  system  will  halt,  the 
synchronous  system  will  issue  a  signal  that  it  is  not  ready.  Thus,  an 
asynchronous  system  will  be  less  likely  to  attempt  to  stop  the  stream  of  input 
data.  Because  the  asynchronous  system  queues  its  jobs,  the  response  time  for  a 
particular  job  can  become  large;  however,  this  is  not  taken  into  account  with 
the  synchronous  system.  Response  time  only  refers  to  jobs  that  are  in  the 
system,  thus,  statistically  the  SRT  is  biased  in  favor  of  the  synchronous 
system. 

The  probability  of  halting  levels  because  of  queue  overflow  in  subsequent 
levels  can  be  greatly  reduced  by  allowing  an  appropriate  queue  size  for  the 
level-level  queues.  The  next  question  becomes:  “What  is  an  appropriate  queue 
size  for  a  level-level  queue?”  This  question  can  be  addressed  probabilistically. 


From  the  definition  of  Q,  P(queuelength  >  Q)  =  0.5,  this  can  be  expanded  to: 
P(queuelength  >  k  Q)  =  0.5k.  Depending  on  the  margin  of  safety  desired,  the 
buffer  size  can  be  chosen  appropriately.  Table  5,4.6. 1  shows  the  probability  of 
the  queue  overflowing  versus  the  size  of  the  queue.  This  table  shows  the 
expected  probability  of  overflow  for  a  queue  that  is  a  multiple  of  Q.  It  does 
not  take  account  of  the  processing  requirements  of  the  data  sets,  i.e.,  how  likely 
is  it  that  level  I  will  complete  its  processing  slower  than  the  rest  of  the  system 
on  this  many  data  sets.  Here,  the  underlying  assumption  is  that  level  I  is  as 
likely  to  be  slow  after  one  job  as  after  100.  Such  being  the  case,  this  table 
represents  a  ceiling  on  the  probability  of  overflow. 

When  the  memory  required  for  Q  is  large,  the  cost  of  overflow  protection 
can  be  significant,  so  a  tighter  limit  may  be  required.  If  the  queue  for  level  I 
overflows  then  the  throughput  of  the  system  will  drop  to  that  of  a  synchronous 
system.  Further,  it  will  have  the  response  time  of  an  asynchronous  system  with 
its  buffers  full  for  levels  1  -  1-1.  for  a  system  with  1  levels,  the  response  time 
can  be  calculated  as  fo.lows: 

SRT(overflow)  =  VEPT;  +  VQjEPT,  +  E  QiEPT; 

i  =  l  i  =  1  i=I  + 1 

This  can  represent  a  significant  amount  in  the  case  where  a  large  amount  of 
buffering  is  allocated  (again  it  should  be  noted  that  a  synchronous  system 
would  halt  its  input  stream).  When  only  k  of  the  previous  input  buffers  fill, 
the  SRT  can  be  shown  to  be: 

1  I  1 

SRT(k  levels  full)  =  VEP;  +  V  QjEPT,  +  £  Q;EPTj 

i  =  l  i  =  I-k  i  =  I  + 1 

The  previous  tables  of  asynchronous  system  values,  thus  need  to  be  weighted 
according  to  the  probability  of  overflow.  This  is  done  as  follows: 


Table  5.4.6. 1 


Probability  of  overflow  versus  Q 


P(overflow) 

Q  X 

0.50000 

1.0 

0.25000 

2.0 

0.12500 

3.0 

0.06250 

4.0 

0.03125 

5.0 

0.15625 

6.0 

0.00781 

7.0 

0.00390 

8.0 

0.00195 

9.0 

0.00098 

10.0 
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SRT(w/overflow)  =  (l- T/'  P;(overflow))SRT 

1  =  1 

l  i 

+  V  Vp;(overflow  affects  j  levels)x(number  of  jobs  queued)xt;(max) 
i=i j=l 


The  case  where  the  overflow  causes  level  1  to  stop  the  unit  loading  the  data, 
lost  data  sets  cannot  be  counted  and  cannot  be  accounted.  For  calculation, 
this  can  be  avoided  by  assuming  the  level  1  has  an  arbitrary  queue  size,  thus 
no  jobs  are  lost,  only  queued. 

The  system  throughput  (ST)  can  be  calculated  by  a  weighted  summation 
of  the  synchronous  and  asynchronous  cases.  Define  ovf  to  be  the  total 
probability  that  an  overflow  will  occur.  Then  1-ovf  will  be  the  probability  that 
an  asynchronous  system  will  run  asynchronously.  For  the  asynchronous  cases 
presented,  this  becomes: 

ovf  =  ovf,  +  ovf2 

(ovf|)  is  probability  that  level  i  will  overflow  its  input  buffer.  Using  the 
definition  of  ovf,  the  weighted  system  throughput  becomes: 

ST(with  overflow)  =  (1  -  ovf)ST(asynchronous)  +  (ovf)ST(synchronous) 


Thus,  for  a  system  where  the  response  time  for  the  first  level  is  a  constant  t, 
and  the  response  time  for  the  second  level  is  a  Gaussian  distribution  function, 
if  C'=0.75  and  b=0.75,  and  the  total  probability  of  overflow  is  0.75,  ST  is: 


ST 


.75  x 


1.0 
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.25  x 


0.89 

*-1 


0.97 
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The  overflow  of  a  buffer  does  not  have  a  significant  effect  on  an 
asynchronous  system’s  throughput.  It  affects  the  SRT.  A  synchronous  system 
(built  with  similar  hardware)  would  use  some  form  of  flow  control  to  inform  the 
device  supplying  the  inbound  data  to  halt.  An  asynchronous  system  would 
have  to  do  the  same  only  when  all  the  internal  buffers  were  filled.  For  real¬ 
time  systems,  this  is  clearly  not  desirable  because  data  would  be  lost.  An 
asynchronous  system  would  be  less  likely  to  stop  the  incoming  flow  of  data 
than  a  synchronous  system  because  of  the  internal  queuing. 

5.4.7.  Double-buffering  Versus  Triple-buffering  —  An  Analysis 

In  general,  a  synchronous  system  that  is  double-buffered  will  have  a  faster 
SRT  than  a  system  that  is  triple-buffered.  Both  systems  will  have  the  same 
ST.  Thus,  it  is  reasonable  to  question  the  need  for  a  triple-buffered  system. 
The  following  discussion  will  consider  this  question. 

Assume  that  each  level  of  the  proposed  system  is  physically  remote  from 
the  other  levels.  The  time  for  a  level  to  write  data  into  the  double-buffer  would 
be  <5txdss,  where  dss  is  the  size  of  the  data  set  size.  Here,  the  processor  would 
wait  for  dss  responses  from  the  buffer  memory.  This  would  adversely  affect  the 
processing  speed  of  the  system  because  the  data  transfer  time  is  a  portion  of 
the  processing  time.  If  a  triple-buffered  system  were  to  be  used  here,  the  total 
data  transfer  time  would  have  to  exceed  the  maximum  processing  time  of  the 
levels  involved  before  it  could  have  a  affect  ST.  Where  levels  are  not 
geographically  remote,  6t  is  small  and  has  little  effect.  Thus,  triple-buffering  is 
not  needed. 


A  combination  of  the  two  buffering  strategies  is  possible  where  some  levels 
are  close  and  others  remote.  An  example  of  this  can  be  seen  where  some  levels 
share  a  given  rack,  while  others  are  in  another  rack.  Between. levels  in  a  given 
rack,  data  transfers  are  quick,  so  double-buffering  can  be  applied.  Between 
racks,  data  transfers  may  be  slow,  so  triple-buffering  may  be  applied.  This 
technique  offers  the  advantages  of  a  trple-buffered  system  without  unnecessary 
delays  where  data  does  not  need  to  travel  a  great  distance.  Further,  the 
throughput  of  the  system  is  not  degraded  when  the  data  must  travel  to  a 
remote  location. 

Thus,  where  transmission  time  between  levels  is  significant,  triple-buffering 
is  a  useful  tool  because  it  overlaps  the  data  transmission  time  with  the  data 
processing  times.  When  transmission  time  is  not  a  significant  problem,  the 
triple-buffered  scheme  will  use  extra  memory  and  hardware.  Further,  use  of  a 
triple-buffered  system  will  increase  the  system  response  time  by  the  response 
times  of  all  levels  using  triple-input-buffers.  This  can  represent  a  significant 
increase  in  system  response  time  over  the  double-buffered  approach. 

5.4.8.  Synchronous  Systems  Versus  Asynchronous  Systems 

Where  there  is  no  ceiling  on  the  response  time  of  a  system,  such  as  in  a 
non-real-time  environment,  asynchronous  systems  offer  potentially  greater 
throughput  than  synchronous  systems  with  the  same  processing  hardware. 
Considering  that  in  a  synchronous  system,  levels  that  complete  their  processing 
sooner  than  other  levels  must  wait  for  the  slower  levels  to  “catch  up."  In  a 
real-time  environment,  synchronous  and  asynchronous  systems  built  with 
similar  hardware  will  yield  interesting  results.  If  the  hardware  is  built  so  that 
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the  synchronous  system  can  keep  up  with  the  incoming  data,  the  asynchronous 
system  will  be  idle.  The  response  time  on  the  asynchronous  system  will  be 
slightly  faster  than  on  its  synchronous  counterpart,  because  the  entire 
synchronous  system  waits  on  its  slowest  level.  The  entire  asynchronous  system 
will  not  wait  on  the  slowest  level  (unless  the  pipeline  is  full),  so  the  effect  of  the 
slowest  level  is  more  limited  in  the  asynchronous  case. 

If  two  independently  designed  systems,  one  synchronous,  the  other 
asynchronous,  are  built  to  process  the  same  task,  There  will  be  little  difference 
in  the  price.  The  synchronous  system  will  require  more  expensive  processing 
hardware,  while  the  asynchronous  system  will  potentially  require  extra 
memory.  The  key  issue  is  that  if,  from  a  given  database,  a  synchronous  system 
cannot  be  built  to  execute  a  given  task  within  some  time  constrains,  an 
asynchronous  could  be  used  to  up  the  ST  and  decrease  the  SRT  by  a 
significant  amount.  For  non-real-time  systems,  asynchronous  systems  can  be 
built  for  less  money  than  the  synchronous  systems  because  large  amounts  of 
buffer  memory  are  not  needed  --  thus  the  asynchronous  system  would  be  the 
better  choice.  For  real-time  systems,  the  asynchronous  system  has  a  variable 
response  time,  depending  on  the  loading  of  the  pipeline,  thus  if  a  variable 
response  time  is  undesirable,  a  synchronous  system  should  be  used. 


5.5.  System  Simulation  —  Results 

To  either  prove  or  disprove  the  theoretical  results  presented  in  the 
previous  sections,  a  simulator  (shown  in  Appendix  1,  was  developed.  This 
section  will  present  and  analyze  the  results  of  the  simulation. 

The  numbers  presented  in  Table  5.5.1  represent  the  simulated  performance 
of  a  two  level  system,  where  the  response  time  of  the  first  level  is  fixed  and  the 
response  time  of  the  second  level  is  a  uniform  random  variable.  Both 
synchronous  and  asynchronous  statistics  are  shown.  When  the  synchronous 
statistics  are  compared  with  the  statistics  shown  in  Table  5. 4.3.2,  the  theory 
predicts  the  the  actual  results  with  a  maximum  error  of  0.03.  “b  (actual)"  is 
the  actual  ratio  of  t2  to  tj.  Due  to  some  inconsistencies  in  the  random  number 
generation,  the  expected  ratio,  as  defined  in  the  second  line  of  the  results,  is 
not  the  actual  ratio.  Results  for  the  case  when  level  2  is  a  Gaussian  random 
variable  are  shown  in  Table  5.5.2.  When  compared  with  the  results  shown  in 
Table  5.4.3. 1  (C  for  this  table  is  approximately  0.25),  the  simulated  results 
differ  from  the  theoretical  results  by  no  more  than  1  percent.  Thus,  for  the 
simulated  data  sets,  the  theory  presented  is  an  accurate  representation  or  the 
actual  results. 

When  comparing  the  results  for  the  asynchronous  system  in  Table  5.5.2  to 
the  proposed  results  shown  in  Table  5.4.4. 1,  it  is  again  necessary  to  take  “b 
(actual)’'  into  account.  For  uniform  data  there  are  some  interesting  results 
that  arise  from  this  comparison.  Taking  the  ratios  (based  on  the  actual  value 
of  b)  of  Qmax  and  Q  to  the  Q  predicted  in  Table  5.4.4. 1,  the  first  ratio  falls  in 
the  range  1.45 ±0.155.  The  second  ratio  falls  in  the  range:  0.14  ±0.05. 
Comparison  of  these  results  shows  that  the  maximum  queue  size  is 
approximately  ten  times  the  average  queue  size  (based  on  200000  test  runs). 


Parameters 


Synchronous 

System 


Asynchronous 

System 


t2_ 

tl 

b  (act) 

SRT(DB) 

SRT(TB) 

ST(Xtl) 

Qmax 

Q 

SRT 

tl 

1.00 

1.06 

2.56 

5.12 

.78 

11596 

5764 

6113 

0.95 

1.02 

2.48 

4.96' 

.81 

2230 

976 

988 

0.90 

0.96 

2.41 

4.82 

.83 

39 

3.49 

5.04 

0.85 

0.90 

2.34 

4.68 

.86 

16 

1.42 

2.99 

0.80 

0.84 

2.26 

4.52 

.88 

7 

0.73 

2.33 

0.75 

0.80 

2.20 

4.40 

.91 

6 

0.54 

2.08 

0.70 

0.74 

2.14 

4.28 

.93 

4 

0.29 

1.89 

0.65 

0.70 

2.08 

4.16 

.96 

2 

0.17 

1.76 
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Table  5.5.2. 


Gaussian  system  response  times. 


Parameters 


Synchronous 


Asynchronous 


System 


Samples 


200000 

200000 

200000 


t2 

tl 


SRT(DB) 


1.00 

0.95 


2.10 

2.08 


0.90 


2.03 


SRT(TB) 


4.20 

4.16 

4.06 


System 


ST(Xtl) 


Qmix 


0.95 

0.96 

0.99 


520 

201 

1 


Q 

353.73 

36.45 

0.15 


SRT 

tl 

255.71 

38.03 

1.92 


2‘ 
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Further,  that  with  $9%  accuracy,  1.45  times  the  predicted  queue  size  yields  the 
maximum  queue  size.  Thus,  for  response  times  that  can  be  modeled  as  a 
uniform  distribution,  1.45  times  the  queue  size  predicted  by  the  application  of 
the  Pollaczek-Khinchine  is  an  accurate  model  of  the  maximum  queue  needed 
for  the  inter-level  buffer  of  a  system  where  a  level  with  a  fixed  execution  time 
is  feeding  data  to  a  level  with  an  execution  time  that  can  be  modeled  by  a 
uniform  random  variable. 

If  the  execution  time  of  the  second  level  in  a  system  can  be  modeled  by  a 
Gaussian  random  variable,  the  Pollaczek-Khinchine  rule  does  not  accurately 
predict  the  expected  queue  size  of  the  system  when  the  average  response  time 
of  the  second  level  is  more  than  .95  times  the  response  time  of  the  first  level. 
This  is  shown  by  comparing  the  results  in  Table  5.5.2  with  those  in  Table 
5. 4. 4.1. 

The  statistics  presented  Table  5.4.5. 1  for  a  synchronous  system  are  30 
percent  lower  than  results  achieved  through  simulation  shown  in  Table  5.5.3. 
For  the  simulation,  =  0.570  and  C2  =  0.577.  The  results  of  the  simulation 
show  that  a  synchronous  system  behave  slightly  worse  than  the  theory  predicts. 
Finally,  the  simulation  results  presented  in  Table  5.5.4  (C=0.58)  show  that  the 
expected  theoretical  queue  sizes  presented  in  Table  5. 4. 5. 2  are  30  percent 
greater  than  the  actual  average  queue  size. 

The  results  of  the  simulation  show  that  synchronous  systems  behave  worse 
than  the  theory  predicts  and  that  asynchronous  systems  behave  better  than  the 
theory  predict.  To  achieve  the  same  throughput,  synchronous  systems  require 
hardware  that  is  twice  as  fast  as  asynchronous  systems  performing  the  same 
task.  Further,  for  a  two  level  system,  when  the  expected  response  time  of  the 
second  level  is  75  percent  of  the  first  level,  the  asynchronous  system  will 


Samples 

_t2 

tl 

SRT{DB) 

SRT(TB) 

ST(/tl) 

IE +06 

1.00 

2.71 

5.42 

0.74 

1E+06 

0.95 

2.64 

5.28 

0.76 

IE +06 

0.90 

2.60 

5.20 

0.77 

IE +  06 

0.85 

2.53 

5.06 

0.79 

IE +06 

0.80 

2.46 

4.92 

0.81 

IE  +06 

0.75 

2.42 

4.84 

0.83 

1E  +  06 

0.70 

2.37 

5.74 

0.85 

1E  +  06 

0.65 

2.32 

5.64 

0.86 
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Tabic  5.5.4 

Performance  statistics  for  an  asynchronous  system  whose  levels  can  be  modeled 
by  uniform  random  variables. 


Samples 

_t_2 

tl 

Qrnw 

Q 

2E+05 

1.00 

177 

62.8 

1.00 

2E+05 

0.05 

30 

6.01 

1.00 

2E+05 

0.00 

29 

2.97 

1.00 

2E+05 

0.85 

20 

1.76 

1.00 

2E+05 

0.80 

13 

1.22 

1.00 

2E  +05 

0.75 

12 

0.88 

1.00 

2E+05 

0.70 

10 

.67 

1.00 

I 


j 

c 

i 

t 

■ 

!*- 


+1 


ft 
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generate  results  more  quickly.  This  is  the  worst  case  and  occurs  when  the 
distribution  of  the  second  level  can  be  modeled  by  a  uniform  distribution. 
When  the  second  level  can  be  modeled  by  a  Gaussian  distribution,  the  results 
are  more  pronounced  (the  cross-over  point  occurs  when  the  average  response 
time  of  the  second  level  is  85  percent  of  the  first  level). 

5.8.  Conclusions 

Theory  and  background  information  was  presented  to  relate  the 
performance  of  both  asynchronous  and  synchronous  systems.  The  theory 
predicted  that  synchronous  systems  would  have  84  percent  of  the  throughput 
of  asynchronous  systems.  Through  simulation,  this  figure  was  shown  to  be  up 
to  169c  high.  Application  of  the  theory  to  asynchronous  systems  showed  that 
for  certain  hardware  configurations,  asynchronous  systems  had  both  greater 
throughput  and  lower  response  time.  The  key  disadvantage  to  asynchronous 
systems  is  clearly  that  the  response  time  can  vary  by  a  large  amount.  For 
real-time  systems,  this  fact  is  significant  enough  that  asynchronous  systems 
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SIMULATOR  LISTINGS 

System  Simulator  Level  1  Response  Time  Fixed  and 
Level  2  Response  Time  Uniform  Random  Variable 


#i  nc  I ude<s  td i o .  h> 


This  simulator  was  written  for  a  Gould  Powernode  9080  running  UNIX 
4.2  BSD  It  simulates  the  execution  of  both  a  synchronous  and  an 
asynchronous  two  level  system,  keeping  track  of  the  I  eve  I - 1 o - I  eve  I 
buffersiies  and  the  response  times  of  the  second  level  For  this  • 
simulator,  it  is  assumed  that  the  distribution  describing  the  res- 
t i me  of  the  second  level  is  the  uniform  distribution  function,  and 
that  the  first  level  has  a  constant  response  time  It  is  run  on  a  • 
total  of  200,000  data  sets  per  simulation  statistic.  requiring 
more  than  1.0Mb  storage  and  more  than  2  hours  to  complete 


^define  EVETJTS  200000 
wdefine  FEVEIMTS  200000.0 
FILE  • f 1 , 
int  density[l6|; 


/•  NOBASE  •/ 

float  t  ime  I  i  s  t  (  I  OO-'-EVETn'S  j  ; 

i  nt  pqueue s i i e ,mque ues  i  i e  ; 

int  rqueuesiie; 

float  i d I et ime , 

float  r 1 s  p t i me , 

float  resptime; 

float  rrsptime. 

float  sysresptime, 

float  synchresptime. 


/  •  Events  per  simulation 
/•  floating  point  representation 
/•  file  pointers 
/•  for  density  function 
/•  Gould  Firebreather  9080  Compiler 
/•  needs  the  NOBASE  because 


i t  needs  to  use 


i  ffe  r  e  n  t 


float  pent! 


=  !  1  00  . 

0  95 .  0  90, 
085.  0  80 , 
075,  0  70, 
0.65,  0  60 , 
0  55.  0  50. 


/•  addressing  mode  to  handle  large  ♦  / 
/•  arrays.  •  / 

/•list  of  event  times  •/ 

/•  present  and  maximum  queuesises  •/ 

/•  total  jobs  stored  in  queues  •; 

/•  idletime  for  second  level  •  / 

/•  response  time  for  first  level  •  / 

/  •  response  time  for  second  level  •  / 

/•  cumulative  time  for  second  level  •  / 

/•system  response  time  «/ 

/♦  synchronous  system  response  time  •/ 

/•ratio  of  •  / 

/•  response  time  of  level  2  • 


response  time  of  level  1  •, 


ma  i  n  (  ) 

{  float  k  t i me  ; 

float  r t ime , 
int  eventtime, 
int  i  , 

int-  j  , 


/•  time  keeper 
/•  response  time  keeper 
/•  event  duration 
/•  event  counter 
/•  event  counter 


fl=fopen(*si  mu  lation. results’, "w"),  /  •  open  files 


s  random) ge  t  p i d (  ) ) ,  / 

i  f  {  f  i=NULL) 

ex i 1 1  )  ( 

for(j=0,pcnt [ j j >0.0: j++) 


pqueues i je=0 , 
mqueuesise=0, 
rqueues i :e=0 ; 
i  d  I  e  t  i  me  =  0  0 


/•  initialise  all  statistical  keepers  •/ 


r  1  9  p  t  i  me  =  0.0; 
r  e  9  p  t  i  me  =  0.0; 
r  r j  pt ime  =  0.0; 
sysrespt ime  =  0.0; 
f  o  r ( i =0 ; i<l 6 ; i ++) 

dens i ty [ i | =0 ; 

kt i me =8  0/16  0 , 

f  o  r  (  i  =0  ;  i<l  00+EVEJ7TS  ;  i  -t-+)  /•  generate  random  event  I 

{  t i me  I i s  t  {  i  |  =  k  t i me , 

k  t  ime+=0  5 ; 

> 

r 1 9 pt ime=k t  ime  ; 

9ynchre9ptime=0.0; 
kt ime=0  0  , 
r  rapt ime=0 . 0  ; 
f  o  r  (  i  =0  ,  i  <£VENTS  ;  i  +-e ) 

{  eventtime  =017  &  random();  /•  random  n  amber 

/•  0  and  IS 

d  e  n  3 ity(eventt ime j  ++ ; 
eventtime+=l. 

rtime  =  pcnt|jj/160*(float)  eventtime; 
r  r  9  pt ime+=r  time; 
i f ( rt ime<( 80/16  0 ) ) 

9 y nc hr es p t ime+=8  0/  1  6  0; 

else 

synchrespt i me  +=r  time; 

9tatqueue(i+l,kti me. rtime),  /•  queue  s i re 

i  f  (  kt ime<t ime  I i s  t  [  i  j  ) 

{  i  d  I  e  t  i  me  +=t  i  me  I  i  s  t  (  i  j  -  k  t  i  me  . 

kt ime=t ime I i s  t [ i j  +rt ime , 


k  t  i  me-v-=r  time, 

r  e  s  p  t ime  +=k  1 1 me  - 1 i me  I i 9 1 [ i j ; 

■  H  *>«  ) 

(  9ysrejptime+= 

rt ime+t ime I i 9 1 [ i I  - 1 ime I i 9 1 ! i  - 1 


fprintf(fl,*\f\n\n'n  System  s  i mu  I  a  1 1  0 n \ n \  n’  )  ; 

f p r 1 nt f ( f 1 ,  ’  Samp  I e  set  sire  c>o6d\n’, 

EVENTS ) ; 

f  p  r  1  n  t  f  (  f  1  ,  "  Pr  oc  e  s  s  i  ug  time  of  level  2  (times  level  1)  °o6  2f\n 
p  c  n  t  j  j  j  )  , 

fprmtf(fl."  Average  processing  time  (level  1)  °x6  2  f '.  n 

r 1 9  p  t 1  me / ( 8  0  a  t ) ( 1 0  0 +EVENTS ) ) , 

fprintf(fl.  Average  processing  time  (level  2)  2f  n 

r  r  s  p  t  1  me  FEVENTS  ) 

f  p  r  1  n  t  f  (  f  1  . "  \  0  Asynchronous  system  statist  ics'.n"  )  , 

fprintfffl," Average  sire  of  level-level  queue  °o6  2f  n 

(Boat)  rqu eu e s  1  1  e  FEVENTS )  . 

f p r 1 n l f ( f 1 , ’Max imum  sire  of  level-level  queue  ^06 d'n '  , 

mqueuesire). 

fprintfffl. ’Average  response  time  (level  2)  9c6  2  f '  n 

r  e  s p  t  ime  /  FETVENTS  )  . 

f p r i nt f ( f 1 .  Ave rag e  system  response  time  (times  tl)  °o6  2 f  '  n 
(  (  r  1  s  p t  1  me  /  (  FEVENTS-M  00  0  )}•*-(  re9  pt  ime  F EVENTS  )  )  / 
irlsptime/(float)(100  EVENTS  )  )  )  , 

fp r 1 nt f ( f 1 . " App rox imat e  Percent  Idle  time  (level  2)  2f\n 

100  0  •  1  d  I  e  t  1  me  / 1  1  me  I  1  s  t  [  EVENTS  ;  )  ; 
f p r  i  n t f ( f 1 , ’  a  Synchronous  system  s t a t  1  s t i c 3 \n"  )  , 

f  p  r  1  n  t  f  ( f  1  ,  ’  Syn  c  h  r  ono  u  3  SRT(DB)  (x  tl)  956  2f\n 

(2*9ynchre9pt  ime  'FEVENTS  )  / 

(rlspti me /(float)) 100  -“EVENTS  )  )  )  ; 

f  p  r  1  n  t  f  (  f  1  .  "  Syn  c  h  r  on  0  u  3  SRT(TB)  (x  tl)  9o6  2f\n 

(3«9ynchrespt ime/ FEVENTS ) / 

(rlsptime,  (float) (100 +EVENT5 ) ) ) , 

f p r  i  n t f ( f 1 , ’ Syn c h r ono u s  ST ( /  tl)  %6  2  f \ n 

1  0/((9ynchre9ptime/ FEVENTS ) / 

(  r  I  9  p  t  1  me  /  (  fl  0  a  t  )  (  1  0  0  +EVENTS  ) )  )  )  , 


f pr i nt f ( f 1 , " \n\n  Distribution  of  level  2  response  times:\n”) 

f  o  r ( i =0 ; i<8 ; i ++) 

fprintf(fl  ,  ’%l  Od  -  <*6d  <%2d  -  '%6d\n’\ 

i+l, densi ty [ i ] , i  +9 , den  s i tyl i+8j ) ; 

ffln  s  h  (  f  1  )  ; 


statqueuefpos , prest ime, proct i me ) 
i nt  pos  ; 

float  prest ime,  proctime, 

{  int  j; 

pqueues i «e=0 ; 

f  or(  j=pos  ;((]<(  1  OO+EVn^TTS  )  )&&(  t  ime  I  i  s  t  ( j  ]<( p i  e s  t  ime+p r o c  1 1 me ) ) )  ,  j  ++) 
pqueues i s e++; 

if(pqueuesis  e>mqueue  s  i  >  e  ) 

mqueuesiie=pqueuesise, 

rqueuesise+=pqueuesiie; 
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System  Simulator 

Level  1  Fixed  and  Level  2  Gaussian  Random  Variable 


#i dc ! ude<s  td i o  h> 


This  simulator  was  written  for  a  Gould  Powernode  9080  running  UNIX 
4  2  BSD  It  simulates  the  execution  of  both  a  synchronous  and  an 
asynchronous  two  level  system,  keeping  track  of  the  level-to- level 
buffersiies  and  the  response  times  of  the  second  level  For  this  • 
simulator,  it  is  assumed  that  the  distribution  describing  the  res- 
t i me  of  the  second  level  is  the  uniform  distribution  function,  and 
that  the  first  level  has  a  constant  response  time  It  is  run  on  a  • 
total  of  200,000  data  sets  per  simulation  statistic,  requiring 
more  than  1  0Mb  storage  and  more  than  2  hours  to  complete 


^define  EVENTS  200000 
«*define  FEVENTS  200000.0 
FILE  *fl  ; 
int  densityjl6i; 


Events  per  simulation 
floating  point  representation 

>  file  pointers 

>  for  density  function 

>  Gould  Firebreather  9080  Compiler 

>  needs  the  NOBAS E  because 


/• 

it  needs  to  use  a  different 

•  / 

/• 

addressing  mode  to  handle  large 

• 

/•  NOBASE  •/ 

/• 

arrays 

• 

float 

t  i  me  1  i  s  t  i  1  0  0  “EVENTS 

/• 

list  of  event  times 

•/ 

1  0  t 

pqueues i if .mqucues i zc , 

/• 

present  and  maximum  queues  lies 

• 

i  a  t 

rqueues i ze , 

/• 

total  jobs  stored  in  queues 

• 

float 

i  d 1 e  t ime , 

/• 

idletime  for  second  level 

•/ 

float 

r  1  s  p  t ime ; 

/• 

response  time  for  first  level 

•/ 

float 

r  e  s  p  t ime . 

/• 

response  time  for  second  level 

•/ 

float 

r  r  s  p  t ime, 

/* 

cumulative  time  for  second  level 

•/ 

float 

sysresptime; 

/• 

system  response  time 

•  / 

float 

synchrespt ime; 

/• 

synchronous  system  response  time 

•/ 

float 

pent  {  J  ss  {  ]  00, 

/* 

ratio  of  . .  . .  . 

-  •/ 

0  95  ,  0  90, 

/• 

response  t ime  of  1  eve  1 

2  • 

response  time  of  level  I  •  / 


main!  ) 


floatktime;  /‘time  keeper  •  / 

float  rti me,  /•  response  time  keeper  •  / 

inteventtime;  /‘event  duration  •  / 

inti;  /‘event  counter  •/ 

intj,  /‘event  counter  •/ 

fl=fopen(’si mutation. results  gauss’, '  w '  )  ;  /•  open  files 

s r a ndom( ge t p i d ( ) ) ;  /*  initialise  random  number  generator  •  / 

i f ( f I ==NULL ) 

ex i t  (  ) 

f  o  r  (  j  =0  .  pc  n  t  ,  j  >0  0  ,  j ) 

{  pqueuesise=0;  /•  initialise  all  statistical  keepers  •/ 

mque  ue  s i s  e=0 , 
rqueuesise=0, 
i  d  I  e  t  i  me  =  0  0  : 
r 1 s  pt ime  =  0  0 
resptime=  0  0, 
rrsptime=  00. 
sysresptime=  0  0, 
f  o  r ( i =0 ,  i<1  6  ;  i  ■+•+•  j 

density] i J  =0 ; 

k  t ime=8  0/16.0; 

f  o  r  (  i  =0  ,  i<l  00+EVEI^rS  ,  i -s-*-)  /•  generate  random  events  •/ 

{  t i me  I i s  t [ i j  =  k  t i me , 

kt  ime-*=0  .  S  , 

} 

r 1 s  p t ime=k  t ime , 


synchresptime=00; 
k t i me=0 . 0 ; 
r  r  s  pt ime=0 . 0 ; 
f  o  r  (  i  =0  ;  i<EVEI^rS  ;  i-t-t-) 

{  eventtime  =  gauas(),  /•  random  number  •/ 

/•  0  and  IS  •/ 


densityjeventti  me  ]  ++ ; 
e  v  e  n  1 1 I  me += 1 , 

r time  =  pcntjjj/16  0*(float)  eventtime, 
r  r  s  pt ime+=r  time; 
i f ( r  t i  me<  ( 8  0 j 1 6  0  ) ) 

synch r es pt i me+=8  0/16  0, 

else 


synchrespt ime+=r  t ime ; 


statqueue(i+l,ktime,rtime),  /•  queues  i  it 

i f ( kt ime<t ime I i s  t  j i ] ) 

{  i d I e  t i me+=t i me  I i s  t  j i ] - k  t i  me  , 

ktime=t ime I ist I i I  +  r  t ime; 

} 

else 

kt ime+=rt ime , 


r  e  s  pt i me+=k  t i me  - 1 ime I i s  t  [ i 1 , 
if(i>l) 

{  sy s re s pt ime+=r t ime+t ime I i s t 


} 


fpr 

fpr 

fpr 

fpr 

fpr 

fpr 

fpr 

fpr 

fpr 

fpr 

fpr 

fpr 

fpr 

fpr 

fpr 


■ t i me ii s  t [ i - 1 
System  s imu I  at i on  n \n” ) 


nt  f  (  f  1  , "  \  f  \  n\  n\n 

n t f ( f 1 , " Samp i e  set  size;  d \ 

EVENTS) ; 

nt f ( f 1 , ’ Proces s i ng  time  of  level  2  (times  level  1 )  %6  2 f ' n 


%6  2  f \ n 
%6  2  f \ n 


Pent  j  j  ]  )  ; 

nt f ( f 1 , ’ Ave rage  processing  time  (level  1) 
rlspti  me /(float)(l  00+EVENTS)  )  ; 
a  t  f ( f 1 , *  Average  processing  time  (level  2): 

r  r  s  p t  ime  / FEVET^TS )  ; 
ntf(fl,"\n  Asynchronous  system  statistics'^"); 

nt f ( f 1 , ’ Ave rag e  site  of  level-level  queue:  9o6  2f\n 

(float)  rqueues i s  e  /  FE2VENTS ) , 

nt  f  (  f  1  ,  ’Max  itnum  sise  of  level-level  queue  %6d\n*  , 

mqueues i se  )  ; 

nt  f  (  f  1  ,  ’Ave  rage  response  time  (level  2).  %.2f\a 

r  e  s  p  t  i  me  /  FEVET^TS )  ; 

nt f ( f 1 , "Ave rage  system  response  time  (times  tl)  %6  2f\n 
((rlspti me  /  (FEVETVTS  +  1  00  0))+(respt  ime/FEVENTS ) )  / 
(rlspti me /(float)(  1 0  0 -‘-EVENTS  )  )  )  ; 
u t f ( f 1 , " App r ox i ma t e  Percent  Idle  time  (level  2)  ?o6.2f\n 

1  00  0 • i d I e  t  ime/t  ime  I  ist  [  EVENTS  |  )  ; 
ntf(fl,’\n  Synchronous  system  s t at i s t i c s \n" ) ; 

n t f ( f 1 , " Syn c h r onou s  SRT(DB)  (x  tl):  %6.2f\n 


nt  f  ( 


(2*synchresptime/  FKVENTS )  / 
(rlspti me /(float))  1  n  0 -‘EVENTS  )) )  ; 
11, "Synchronous  SRT(TB)  (x  tl) 
(3«synchrespt  i me  /FET/ENTS  ) / 
(rlspti me /(float  )(  100  -*EVENTS  )  ) )  ; 
nt f { f 1 , " Sync h ronon s  ST( /  t 1 ) : 

1  0/ (  (  synch  resf  '  me /FEVENTS  )  / 
(rlspti  me /(float  )  (  10  0-+EVENTS  )  )  )  ) 


%6 . 2f \n 
%6 . 2f \n 


fprintf(fl,’\n\n  Distribution  of  level  2  response  t  i  me  s  : \ n "  ) 
f  o  r ( i =0 ; i<8 ;  i ++) 

f  p  r  i  nt  f  (  f  1  ,  '%\  Od  -  ^6d  %2d  -  <^6d\n", 

i+l.density'i 1 . i , density1 i+8i ) ; 

fflush(fl); 


statqaeue( pos , prest ime, proct ime ) 
i  nt  pos , 

float  prestime,  proct ime , 

{  i »t  j  , 


pqneue  s i s  e=0 , 

f  o  r(  j=po  s ; ( ( j<( lOO^EVENTS ) )&&( t ime I i s  t [ j ]<( pr es  t ime+proc t ime ) ) ) ; j  ++) 


pque  ue  s  i  i e-t-+; 


cf(pqueuesi»  enqueue  s i  I  e  ) 

mq  u  e  u  e  s  i  ie=pqueues  i  it  , 

r qu  eue  s i ie+=pqueues i I  e , 

/•  generate  gaussian  rv  using  central  limit  tbeorm  •/ 
register  int  i,j, 

J  =0  ; 

f or( i=0 , i<20 ; i++) 

j  +=  0  17  St  random!  )  , 
return!  i  / 2  0  )  , 


259 


Results  of  Asynchronous  System  Simulation 
Uniform  Distribution  (Both  Levels) 


System  s imu I  a  t  i  on 

Sample  set  site:  200000 

Processing  time  of  level  2  (times  level  1):  1.00 

Average  processing  time  /level  1):  0  50 

Average  processing  time  (level  2):  0.53 

Asynchronous  system  statistics 
Average  site  of  level-level  queue:  5764.08 

Maximum  site  of  level-level  queue:  11596 

Average  response  time  (level  2).  305616 

Average  system  response  time  (times  til:  6113  30 
Approximate  Percent  Idle  time  (level  2)  0  00 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  til:  2  56 

Synchronous  SRT(TB)  (x  tl):  3  84 

Synchronous  ST(/  tl):  0.78 


Distribution  of  level  2  response  times 


1  - 

12479 

9  - 

1258  1 

2  - 

12480 

10  - 

1  2640 

3  - 

12608 

1  1  - 

12221 

4  - 

12499 

12  - 

12600 

5  - 

12579 

13  - 

12332 

6  - 

12506 

14  - 

1  2474 

7  - 

12592 

15  - 

12562 

8  - 

12453 

16  - 

12394 

System  simulation 

Sample  set  sise:  200000 

Processing  time  of  level  2  (times  level  1)  0  95 

Average  processing  time  (level  1 1 :  0  50 

Average  processing  time  (level  2)  0.51 

Asynchronous  system  statistics 
Average  sise  of  level-level  queue:  976.24 

Maximum  sise  of  level-level  queue:  2230 

Average  response  time  (level  2):  193.88 

Average  system  response  time  (times  til:  988  76 

Approximate  Percent  Idle  time  (level  2)  0.00 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  til:  2.48 

Synchronous  SRT(TB)  (x  tl):  3.72 

Synchronous  ST(/  tl):  0  81 


Distribution  of  level  2  response  times 


1  - 

1  2623 

9  - 

1  2424 

2  - 

1  2496 

1  0  - 

12437 

3  - 

12568 

I  I  - 

1  2467 

4  - 

12538 

12  - 

12302 

5  - 

12392 

13  - 

1  2479 

6  - 

1  2 ' 1  5 

1  4  - 

12504 

7  - 

12578 

15  - 

12579 

8  - 

1247  8 

16  - 

12620 

Simple  set  site:  200000 

Processing  time  of  level  2  (times  level  1):  0.90 

Average  processing  time  (level  1)  0.50 

Average  processing  time  (level  2)  0.48 

Asynchronous  system  statistics 
Average  sire  of  level-level  qnene:  3.49 

Maximum  site  of  level-level  queue:  30 

Average  response  time  (level  2).  2.02 

Average  system  response  time  (times  til:  5  04 

Approximate  Percent  Idle  time  (level  2)  4.13 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  til.  2.41 

Synchronous  SRT(TB)  (x  t 1 ) :  3.62 

Synchronous  ST(/tl):  0.83 


Distribution  of  level  2  response  times: 


1  - 

12606 

9  - 

12614 

2  - 

12606 

10  - 

12404 

3  • 

12365 

1  1  - 

12362 

4  - 

12298 

12  - 

12441 

5  - 

12633 

13  - 

12349 

6  - 

12568 

1  4  - 

12501 

7  - 

1260  1 

15  - 

12307 

8  - 

12774 

16  - 

1257  1 

System  simulation 


Sample  set  site:  200000 

Processing  time  of  level  2  (times  level  1):  0  85 

Average  processing  time  (level  1)  0.50 

Average  processing  time  (level  2)  0.45 

Asynchronous  system  statistics 
Average  size  of  level-level  queue.  1.42 

Maximum  size  of  level-level  queue:  16 

Average  response  time  (level  2):  0.99 

Average  system  response  time  (times  til:  2.99 

Approximate  Percent  Idle  time  (level  2)  9.26 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  11):  2.34 

Synchronous  SRT(TB)  (x  t 1 ) :  3  51 

Synchronous  ST ( /  t 1 ) :  086 


Distribution  of  level  2  response  times: 


1  - 

1  2458 

9  - 

12590 

2  - 

1257  3 

10  - 

12617 

3  - 

12504 

1  1  - 

1  2567 

4  - 

1  2  3  6  3 

12  - 

1  248  8 

5  - 

1  2369 

1  3  - 

1  24  4  4 

6  - 

1  2  6  4  3 

1  4  - 

12608 

7  . 

1  247  5 

1  5  - 

12464 

8  - 

12385 

16  - 

12452 

System  simulation 


Sample  set  site:  200000 

Processing  time  of  level  2  (times  level  1):  0  80 

Average  processing  time  [level  1):  0.S0 

Average  processing  time  (level  2):  0.42 

Asynchronous  system  statistics 
Average  site  of  level-level  qneue:  0.73 

Maximnm  site  of  level-level  queue:  7 

Average  response  time  (level  2):  0.67 

Average  system  response  time  (times  til:  2.33 

Approximate  Percent  Idle  time  (level  2)  15.11 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  t 1 ) :  2.26 

Synchronous  SRT(TB)  (x  t 1 ) :  3.39 

Synchronous  ST(/  tl):  0  88 


Distribution  of  level  2  response  times 


1  - 

12715 

9  - 

12579 

2  - 

12334 

10  - 

12442 

3  - 

12676 

11  - 

12404 

4  - 

12388 

12  - 

12504 

5  - 

1259  1 

13  - 

12413 

6  - 

12486 

14  - 

12402 

7  - 

12701 

IS  - 

12367 

8  - 

12470 

16  - 

12528 

System  simulation 


Sample  set  sise:  200000 

Processing  time  of  level  2  (times  level  1):  0  75 

Average  processing  time  [level  11:  0  50 

Average  processing  time  (level  2):  0  40 

Asynchronous  system  statistics 
Average  sue  of  level-level  quene:  0.47 

Maximum  site  of  level-level  queue.  6 

Average  response  time  (level  2).  0.54 

Average  system  response  time  (times  til:  2.08 

Approximate  Percent  Idle  time  (level  2)  19  79 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  tl):  2.20 

Synchronous  SRT(TB)  (x  tl):  3  30 

Synchronous  ST(/  tl):  0  91 


Distribution  of  level  2  response  times. 


1  - 

12528 

9  - 

12299 

2  - 

12488 

10  - 

12402 

3  - 

1246  1 

1  1  - 

1  244  2 

4  - 

12530 

12  - 

12627 

5  - 

12657 

13  - 

12592 

6  - 

12552 

14  - 

12639 

7  - 

12518 

15  - 

12303 

System  simulation 


Simple  set  site:  200000 

Processing  time  of  level  2  (times  level  1):  0.70 

Avenge  processing  time  (level  1  1  :  0  SO 

Avenge  processing  time  (level  2):  0.37 

Asynchronous  system  statistics 
Avenge  sire  of  level-level  queue:  0  29 

Maximum  site  of  level-level  queue.  4 

Avenge  response  time  (level  2)  0.45 

Average  system  response  time  (times  til:  1.89 

Approximate  Percent  Idle  time  (level  2)  25  40 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  til:  2.14 

Synchronous  SRT(TB)  (x  tl):  3.21 

Synchronous  ST(/  tl):  0.03 


Distribution  of  level  2  response  times: 


1  - 

12427 

9  - 

12345 

2  - 

12610 

10  - 

12499 

3  - 

12508 

1  1  - 

12433 

4  - 

12517 

12  - 

12447 

5  - 

1255  1 

13  - 

12642 

6  - 

12484 

14  - 

12478 

7  - 

1  2485 

IS  - 

12499 

8  • 

1  244  8 

16  - 

12627 

System  s imu I  at i on 


Sample  setsixe:  200000 

Processing  time  of  level  2  (times  level  1):  0  65 

Avenge  processing  time  (level  1):  0.50 

Average  processing  time  (level  2):  0.34 

Asynchronous  system  statistics 
Avenge  siie  of  level-level  queue.  0  17 

Maximum  sire  of  level-level  queue.  2 

Average  response  time  (level  2):  0  38 

Average  system  response  time  (times  tl)  1.76 

Approximate  Percent  Idle  time  (level  2)  30  85 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  tl)  2  08 

Synchronous  SRT(TB)  (x  11):  3  13 

Synchronous  ST(/  tl):  0  96 


Distribution  of  level  2  response  times: 

1  -  12503  9  -  12439 

2  -  12736  10  -  12485 

3-  12482  11-  12472 

4  -  12569  12  -  12590 

5  -  12294  13  -  12490 

6-  12422  14-  12479 

7  -  12698  15  -  12327 

8  -  12458  16  -  12556 


System  simulation 


Sample  set  site: 

Processing  time  of  level  2  (times  level  1) 
Average  processing  time  /level  1). 

Average  processing  time  (level  2): 

Asynchronous  system  statistics 
Average  sise  of  level-level  queue: 

Maximum  sise  of  level-level  queue. 

Average  response  time  (level  2): 

Average  system  response  time  (times  tl): 
Approximate  Percent  Idle  time  (level  2) 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  tl): 

Synchronous  SRT(TB)  (x  tl): 

Synchronous  ST(/  tl): 


Distribution  of  level  2  response  times: 


1  - 

12414 

9  - 

12525 

2  - 

12558 

10  - 

12522 

3  - 

12514 

11  - 

12353 

4  - 

12525 

12  - 

12470 

5  - 

1252  1 

13  - 

12583 

6  - 

12388 

14  - 

12300 

7  - 

12469 

15  - 

12738 

8  - 

12530 

16  - 

12590 

System  simulation 

S  amp  I e  set  site: 

Processing  time  of  level  2  (times  level  1) 
Average  processing  time  (level  1): 

Average  processing  time  (level  2): 

Asynchronous  system  statistics 
Average  site  of  level-level  queue 
Maximum  site  of  level-level  queue: 

Average  response  time  (level  2): 

Average  system  response  time  (times  tl) 
Approximate  Percent  Idle  time  (level  2) 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  tl) 

Synchronous  SRT(TB)  (x  tl) 

Synchronous  ST ( /  tl): 


Distribution  of  level  2  response  times 


1  - 

12715 

9  - 

1  24  67 

2  - 

12387 

10  - 

12420 

3  - 

1244  3 

1  1  - 

1  2800 

4  - 

12518 

1  2  - 

1  2424 

5  - 

12543 

1  3  - 

12510 

6  - 

1  2420 

1  4  - 

12515 

7  - 

12615 

15  - 

12493 

8  - 

1  2  4  2  1 

16  - 

1  2309 

200000 
0  60 
0.50 
0  .  32 


0 . 09 
1 

0.34 
1  67 
36  08 


2  05 

3  .  07 
0 . 98 


200000 
0 . 55 
0  50 
0 . 29 


0  03 
1 

0 . 30 
1  .  60 
41  29 


2  02 
3  02 

0  99 


System  simulation 


S  tmp  I e  set  s i se : 

Processing  time  of  level  2  (times  level  1) 
Avenge  processing  time  (level  ll 
Avenge  processing  time  (level  2) 

Asynchronous  system  statistics 
Average  siie  of  level-level  queue: 

Maximum  site  of  level-level  queue: 

Average  response  time  (level  2): 

Average  system  response  time  (times  til: 
Approximate  Percent  Idle  time  (level  2) 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  tl): 

Synchronous  SRT(TB)  (x  tl): 

Synchronous  ST( /  tl): 


Distribution  of  level  2  response  times: 


1  - 

1243  1 

9  - 

12626 

2  - 

12573 

10  - 

12516 

3  - 

12501 

1  1  - 

12449 

4  - 

12542 

12  - 

12599 

S  - 

12570 

13  - 

12414 

6  - 

12438 

14  - 

12152 

7  - 

12574 

IS  - 

12523 

8  - 

12538 

16  - 

12554 

200000 
0  SO 
0  SO 
0  27 


0  00 
0 

0.27 
1  .  54 
45  85 


2 . 00 
3  00 
1  00 


Results  of  Simulation 
Gaussian  Distribution 


Sys  ttm  i imul at i on 
Samp  I c  set  s i >  e : 

Processing  time  of  level  2  (timeg  level  1) 
Average  proceggiog  time  (level  lj: 

Average  procegging  time  (level  2). 

Asynchronous  gygtem  statistics 
Average  site  of  level-level  qnene: 

Maximum  site  of  level-level  qnene: 

Average  response  time  (level  2): 

Average  system  response  time  (times  til: 
Approximate  Percent  Idle  time  (level  2) 

Syncbronons  system  statistics 
Synchronous  SRT(DB)  (x  fc 1 ) : 

Synchronous  SRT(TB)  (x  tl): 

Synchronous  ST(/  tl): 

Distribution  of  level  2  response  times 


1  - 

0 

9  - 

4901  8 

2  - 

0 

10  - 

13789 

3  - 

0 

1  1  - 

1476 

4  - 

34 

12  - 

56 

5  - 

1307 

13  - 

2 

6  - 

1284  1 

1  4  - 

0 

7  - 

47401 

15  - 

0 

8  - 

74076 

16  - 

0 

Sys  tern  s imul at i on 
Samp I e  set  site: 

Processing  time  of  level  2  (times  level  1) 
Average  processing  time  (level  1): 

Average  processing  time  (level  2): 

Asynchronous  system  statistics 
Average  size  of  level-level  queue 
Maximum  size  of  level-level  queue 
Average  response  time  (level  2) 

Average  system  response  time  (times  tl); 
Approximate  Percent  Idle  time  (level  2) 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  tl) 

Synchronous  SRT(TB)  (x  tl) 

Synchronous  ST( /  tl) 


bu  t 

ion  of  level 

2  response  times 

1  - 

0 

9  - 

4  9526 

2  - 

0 

10  - 

1  378  4 

3  - 

0 

1  1  - 

1521 

4  - 

45 

12  - 

63 

5  - 

1302 

13  - 

0 

6  - 

1244  2 

1  4  - 

0 

7  - 

47324 

1  5  - 

0 

8  - 

73993 

1  6  - 

0 

200000 
1  .  00 
0.50 
0.50 


253.73 
520 
127 . 36 
255.71 
0.00 


2.10 
3  15 
0 . 95 


200000 
0 . 95 
0  50 
0  4  8 


36  45 
201 
18  52 
38  03 
3  03 


2  08 
3  12 
0  96 
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Sys t  cm  a imu I  a  t i on 


Sample  act  aiac:  ,  200000 

Processing  time  of  level  2  (timea  level  1):  080 

Average  processing  time  (level  1)  0  SO 

Average  processing  time  {level  2):  0.40 

Asynchronous  ayatem  statistics 
Average  site  of  level-level  queue:  0.00 

Maximum  site  of  level-level  queue:  1 

Average  reapoaae  time  (level  2):  0.40 

Average  ayatem  reapoaae  time  (times  tl):  1.80 

Approximate  Perceat  Idle  time  (level  2)  20.18 

Synchronous  system  statistics 
Syachroaous  SRT(DB)  (x  tl):  2.00 

Syachroaoua  SRT(TB)  (x  tl):  3.00 

Syachroaous  ST(/  tl):  1.00 


Distributive  of  level  2  response  times 


1 

0 

9  • 

49437 

2 

0 

10  - 

13690 

3 

1 

1  1  - 

1  SOI 

4 

39 

12  - 

59 

S 

1249 

13  - 

0 

6 

-  12  5  8  3 

14  - 

0 

7 

-  47684 

IS  - 

0 

8 

-  737S7 

16  - 

0 

Sys  t  em  a imu 1  at i on 

Sample  set 

s  i  i  e  : 

200000 

Process ing 

time  of  level 

2  (times  level  1) 

0  7  5 

Average  processing  time 

( level 

i): 

0  SO 

Average  processing  time 

(level 

2): 

0  38 

Asynchronous  system  statistics 

Average  site  of  level-level  queue: 

0  00 

Maximum  site  of  level-level  queue: 

1 

Average  response  time 

( I  eve  1 

0.38 

Average  system  response 

time  ( 

times  1 1  )  : 

1.76 

Approx imat  e 

Percent  Idle 

t  tme 

(level  2) 

24  48 

Synchronous  system  statistics 

Synchronous 

SRT  DB)  (x  tl 

2  00 

Synchronous 

SRTfTB  x  tl) 

3 . 00 

Synchronous 

ST( /  1 1  ) 

1  .  00 

Distribution  of  level 

2  response  times: 

I 

0 

9  - 

4  918  3 

n 

0 

10  - 

13767 

3 

1 

1  1  • 

1545 

4 

48 

12  - 

39 

5 

1338 

13  - 

0 

6 

-  12661 

14  - 

0 

7 

-  47178 

IS  - 

0 

8 

-  74240 

16  - 

0 

Sya  tern  simulation 


Simple  set  site:  200000 

Processing  time  of  level  2  (times  level  1):  0.70 

Avenge  processing  time  (level  1 )  :  0  50 

Avenge  processing  time  (level  2):  0.35 

Asynchronous  system  statistics 
Avenge  sue  of  level-level  queue  0  00 

Maximum  siie  of  level-level  queue  1 

Average  response  time  (level  2).  0.35 

Avenge  system  response  time  (times  tl):  1.71 

Approximate  Percent  Idle  time  (level  2)  29  39 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  til:  2.00 

Synchronous  SRT(TB)  (x  t 1 ) :  3.00 

Synchronous  ST( /  tl):  1.00 


Distribution  of  level  2  response  times: 

1  -  0  9  -  48985 

2  -  0  10  -  13707 

3  -  0  11  -  1447 

4  -  49  12  -  50 

5  -  1231  13  -  0 

6  -  12670  14  -  0 

7  -  47641  15  -  0 

8  -  74220  16  -  0 


Sys  t  em  s imu  I  at i on 


Sample  set  site:  200000 

Processing  time  of  level  2  (times  level  1):  0.65 

Average  processing  time  (level  1):  0.50 

Average  processing  time  (level  2):  0.33 

Asynchronous  system  statistics 
Average  siie  of  level-level  queue:  0  00 

Maximum  sise  of  level-level  queue:  0 

Average  response  time  (level  2):  0  33 

Average  system  response  time  (times  til:  1  66 

Approximate  Percent  Idle  time  (level  2;  34.31 


Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  tl): 
Synchronous  SRTiTB)  (x  tl): 
Synchronous  ST( /  tl): 


2  00 
3  00 
1  00 


Di s  t  r i but 

ion  of  level 

2  response  times 

1  - 

0 

9  - 

494  88 

o  _ 

0 

10  - 

1  3743 

3  - 

0 

I  1  - 

1526 

4  - 

4  4 

12  - 

61 

5  - 

1323 

13  - 

0 

6  - 

12660 

14  - 

0 

7  - 

47239 

15  - 

0 

8  - 

739  1  6 

16  - 

0 

Sys  Km  i  imu  lotion 


Sampl e  jet  site: 

Processing  time  of  level  2  (times  level  1) 
Average  processing  time  (level  1) 

Average  processing  time  (level  2): 

Asynchronous  system  statistics 
Average  siie  of  level-level  queue: 

Maximum  site  of  level-level  queue: 

Average  response  time  (level  2): 

Average  system  response  time  (times  t 1 ) : 
Approximate  Percent  Idle  time  (level  2) 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  tl): 

Synchronous  SRT(TB)  (x  tl): 

Synchronous  ST(/  tl): 


Distribution  of  level  2  response  times 


1 

0 

9  - 

49150 

2 

0 

10  - 

13856 

3 

1 

11  - 

1537 

4 

46 

12  - 

54 

5 

1281 

13  - 

0 

6 

-  12724 

14  - 

0 

7 

-  47220 

15  - 

0 

8 

-  74131 

16  - 

0 

System  simul 

at  i  on 

Sampl e  set  site: 

Processing  time  of  level  2  (times  level  1) 
Average  processing  time  (level  1): 

Average  processing  time  (level  2)' 

Asynchronous  jystem  statistics 
Average  site  of  level-level  queue: 

Maximum  sise  of  level-level  queue: 

Average  response  time  (level  2): 

Average  system  response  time  (times  tl): 
Approximate  Percent  Idle  time  (level  2) 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  tl): 

Synchronous  SRT(TB)  (x  tl): 

Synchronous  ST(  /  tl): 


Distribution  of  level  2  response  times 


1  - 

0 

9  - 

49458 

2  - 

0 

10  - 

1  384  5 

3  - 

0 

1  1  - 

1499 

4  - 

45 

12  - 

59 

5  - 

1244 

13  - 

0 

6  - 

12714 

14  - 

0 

7  - 

4747  4 

15  - 

0 

8  - 

73662 

16  - 

0 

200000 
0  60 
0  SO 
0  30 


0 . 00 
0 

0  30 
1  60 
40  02 


2 . 00 
3  00 
1  00 


200000 
0  55 
0 . 50 
0  28 


0.00 

0 

0  27 
1  55 
45.23 


2.00 
3  00 
1  .00 


System  simulation 


Sample  set  site:  200000 

Processing  time  of  level  2  (times  level  1):  0.50 

Average  processing  time  (level  1):  0.50 

Average  processing  time  (level  2):  0  25 

Asynchronous  system  statistics 
Average  site  of  level-level  queue.  0.00 

Maximum  site  of  level-level  queue:  0 

Average  response  time  (level  2):  0.26 

Average  system  response  time  (times  t 1 ) :  1.51 

Approximate  Percent  Idle  time  (level  2)  48.77 

Synchronous  system  statistics 
Synchronous  SRT(DB)  (x  tl):  2.00 

Synchronous  SRT(TB)  (x  tl):  3.00 

Synchronous  ST( /  tl):  1.00 


Distribution  of  level  2  response  times: 


1  - 

0 

9  - 

49421 

2  - 

0 

10  - 

13776 

3  - 

0 

1  1  - 

1568 

4  - 

40 

12  - 

47 

5  - 

1322 

13  - 

2 

6  • 

12726 

14  - 

0 

7  - 

46966 

15  - 

0 

8  - 

74132 

16  - 

0 

SIMULATOR  LISTINGS 


Asynchronous  System  Simulator  -- 
Both  Levels  Uniform  Random  Variables 


#i nc 1 udeCs  td i o . h> 


/•' 

/• 

: 

/• 

/• 

/* 

/• 

/• 

/• 

/• 

/• 

/•■ 


This  simulator  was  written  for  a  Gould  Powernode  9080  running  UNIX  •/ 
4  2  BSD.  It  simulates  the  execution  of  an  •  / 

asynchronous  two  level  system,  keeping  track  of  the  I e ve I - 1 o - I e ve I  •/ 
buffersites  and  the  response  times  of  the  second  level.  For  this  •/ 
simulator,  it  is  assumed  that  the  distribution  describing  the  res-  •  / 
time  of  the  both  levels  is  the  uniform  distribution  function,  and  */ 
It  is  run  on  a  total  of  200,000  data  sets  per  simulation  statistic  • / 
requiring  more  than  1.0Mb  storage  and  more  than  2  hours  to  •  / 
comp  I e  t  e  •  / 


#define  EVET7TS  200000 
#define  FEVENTS  200000.0 
FILE  • f 1 , 
int  density  I  16]; 


/  •  Events  per  simulation 
/•  floating  point  representation 


/•  NOBASE  «/ 

float  t  i  me  I  i s  t I  1 0  0  ‘‘UVEUTS  |  ; 
int  pqueuesiie.mqueuesiie; 
int  rqueue  s i s  e  ; 
float  idletime; 
float  r 1 s  p t i me ; 
float  respt ime ; 
float  rrsptime; 
float  sysresptime; 


f 

/• 


file  pointers 
for  density  function 
Gould  Fi rebreather  9080  Compi ler 
needs  the  NOBASE  because 
it  needs  to  use  a  different 
addressing  mode  to  handle  large 
a  rrays 
/•  list  of  event  times 
/•  present  and  maximum  queues  iies 
/•  total  jobs  stored  in  queues 
/•  idletime  for  second  level 
/•  response  time  for  first  level 
/•  response  time  for  second  level 
/•  cumulative  time  for  second  level 
/•  system  response  time 


•/ 

'•/ 


•/ 


•?; 


float  pent 


=  (1.00, 
0  95 
0 


85 


0  90, 
0  80, 


/•  rat io  of 

/• 


>/ 


ma  i  n  {  ) 


075,  070. 

/• 

response  time 

0.65,  060, 

0.55,  050, 

0  00  }  ; 

float  k t ime ; 

/. 

time  keeper 

float  rt ime . 

/• 

response  time  keeper 

int  eventtirae; 

/• 

event  duration 

int  i 

/• 

event  counter 

int  j  , 

/• 

event  counter 

int  s  um, 

• 

gauss  las  only 

fl=fopen(" s imulat ion 

results’,"*"),  /•  open  files 

s  random! ge  t  p i d ( ) ) , 

/• 

initialise  random  numbe 

i f ( f 1 ==NULL ) 

ex i t (  )  t 

for(j=0.pcnt  |j  ! >0  0  ,  j 

)  pqueuesite=0, 

/* 

initialize  all  statisti 

response  time  of  level  2  •/ 

.  */ 

level  1  •  / 


*/ 

'*/ 

•/ 

*/ 

•/ 


•/ 


mqu  e u  e  s  i  i e  =0  , 
rqueues i »e=0 , 
i d I et  tme=  0  0 ; 
rlsptime=  0  0, 
resptime=  0  0, 
rrsptime=  0  0, 
sysresptime=  0  0, 
for(  i=0,  i  <  I  6  ,  i-*-+j 
density!  i 


keepers  •/ 


!  =o . 


k  t i me =8  0/  16  0  , 

t  i  me  I  i  s  t  j  0  \  =  (  fl  o  a  t  )  (  (  1  +0  I  7&  r  a  n  d  om(  )  )  )  /  1  6  0, 
f  o  r  (  i  =1  ,  i<1  OO+EVEXTS  ,  i  •*-4-)  /•  generate  random  events  •/ 


Jg 


t  i  me  1 i s  t 
( (float  ) 


/•  1-16 


i  s  t [ i  j  =  t i me  I  istfi-ll  + 
a  t  )  (  l+(  01 7&random(  )  )))/!• 


rlsptime=timelist[  EVENTS  -  1  ]  ; 
kt ime=0  0  , 
r  r  s  p t i me=0  0 , 
f  o  r ( i=0, i<£VENTS ; i ++) 


/•  present  time 


eventtime  =  0 1 7&random( ) ;  /•  random  number  •  / 

/*  0  and  IS  •/ 

densityfeventtime]  -H- ; 
eventtime+=l: 

rtime  =  pcnt[jj/16.0»(float)  eventtime; 
rrspt ime+=rt  ime  ; 

s t a t queue ( i +1 , k t ime , r t ime ) ;  /•  quenesite  ♦  / 

i f ( kt ime<t ime I i s t [ i ] )  /•  update  present  time  •/ 

{  i  d  I  e  t  i  me  +=t  i  me  I  i  s  t  [  i  ]  -  k  t  i  me  ; 

It  t  i me  =t  i me  I  i  s  t  [  i  ]  +r  t  i me  , 


kt ime+=r  time; 


r e  s  p t ime  +=k  t i me  - 1 i me  I i s  t [ i ] 


fprintfffl,”  System  s imn I  a t i o nO ) ; 

f pr i nt f ( f 1 , ’ Samp  I e  set  site:  %6d0  , 

EVENTS ) ; 

fprintfffl, "Processing  time  of  level  2  (times  level  l):%6.2f0 
Pent [ j  ]  )  ;  < 

f p r i n t f ( f 1 , * Ave rag e  processing  time  (level  1):  %6 . 2 f 0 

rlsptime/f  FEVENTS -10)); 

f pr i nt f ( f 1 , ‘Ave rage  processing  time  (level  2):  %6  2f0 

rrspt ime/ FEVENTS ) ; 

f p r i nt f ( f 1 , "Ave rage  siie  of  level-level  queue:  %6  2 f 0 

(float)  rqueue s i t e/FEVENTS ) , 

fpr i nt f ( f 1 , "Max imum  site  of  level-level  queue  %6d0, 

mqueues i te) ; 

f p r i n t f ( f 1 , " Ave rag e  system  response  time  (times  tl)  %6.2f0 
1  0-t-(rrsptime/FEVENTS)/(rlsptime/FEVENTS) 

*(  ( rqueues i ie/FEVENTS)» 

(rrspt  ime  /  FEVENTS  )  /  (  rlspt  ime/(  FEVENTS  -1.0)))); 
f pr i nt f ( f 1 , "Approx imat e  Percent  Idle  time  (level  2)  %6  2f0 

100.0*idletime/timel ist [ EVENTS ] ) ; 

fprintfffl,"  Distribution  of  level  2  response  timesO); 
f  o  r ( i =0 ; i<8 ;  i ++] 

fprintfffl  , ”%1  Od  -  %6d  %2d  -  %6d0, 

i+l,denaity!i),i+9,den*ity[i+8]); 

fflush(fl); 


%6d0  , 


%6 . 2  f  0 
%6 . 2  f  0 
%6  2  f  0 


%6 . 2f0 


%6 . 2f0 


statqueuef  pos , prest ime , proct ime ) 
i  n  t  pos, 

8  o  a  t  prestime,  proctime; 


pqueues i t  e=0  , 

for(j=pos,((j<(100  -EVENTS  )  )&£(  t  i  me  I  :  s  t  [  j  j<(  pr  e  s  t  ime+pnct  ime)  )  )  ;  j  ++) 
pqueues  i  I e+-*-; 

if(pqueuesit  e>mqu  e  u  e  s i z e ) 

mque  uesite=pqueuesite, 

rqueues i te+= pqueues  i  te; 


Asynchronous  System  Simulator  — 
Both  Levels  Gaussian  Random  Variables 


#i dc I ude<s td i o . h> 


/•  This  simulator  was  written  for  a  Gould  Powernode  9080  running  UNIX 
/•  4.2  BSD.  It  simnlates  the  execution  of  both  a  synchronous  and  an 
/•  asynchronous  two  level  system,  keeping  track  of  the  I e ve  I - 1 o - I  eve  1 
/•  buffersizes  and  the  response  times  of  the  second  level.  For  this  • 
/•  simulator,  it  is  assumed  that  the  distribution  describing  the  res- 
/•  time  of  the  both  levels  is  the  uniform  distribution  function,  and 
/•  It  is  run  on  a  total  of  100,000  data  sets  per  simulation  statistic 
/•  requiring  more  than  1  0Mb  storage  and  more  than  2  hours  to 
/ •  comp  I e  t  e . 

/• 


♦/ 

: 

•/ 

•/ 

•/ 


^define  EVENTS  100000 
#define  FEVENTS  100000  0 
FILE  • f 1 ; 
i nt  dens i ty  [  16  ]  ; 


/•  NOBASE  •/ 

8 oat  t i me  I i s  t [ 1 0 0 +EVENTS ]  ; 
int  pqueues  i  se  .enqueues  i  I  e  , 
int  rqueuesite; 
float  i d I e  t ime ; 
float  rlsptime, 
float  respt ime , 
float  rrsptime, 
float  sysresptime. 


/•  Events  per  simulation 
/•  floating  point  representation  • 

/*  file  pointers 
/•  for  density  function 
/•  Gould  Firebreather  9080  Compiler 
/•  needs  the  NOBAS E  because 
/•  it  needs  to  use  a  different 
/•  addressing  mode  to  handle  large 
/*  a r  ray s  . 

/•  list  of  event  times 
/•  present  and  maximum  quenesiies 
/•  total  jobs  stored  in  queues 
/•  idletime  for  second  level 
/•  response  time  for  first  level  • 

/•  response  time  for  second  level 
/•  cumulative  time  for  second  level 
/•  system  response  time 


float  pen t (  | 

=  {1  00, 

/* 

ratio  of 

0.95, 

0  90, 

/• 

response  time 

oT 

level  2 

0  85  , 

0  80. 

/• 

0  75, 

0  70, 

/• 

response  time 

0  f 

level  1 

0.65, 

0.60, 

0  55, 

0  50, 

0  00 

). 

ma  i  n  (  ) 

{  float 

k  t  i  me  , 

/• 

time  keeper 

float 

r  t  ime ; 

/* 

response  time  keeper 

i  n  t 

e  v  e  n  1 1 i me 

/• 

event  duration 

i  d  t 

i  , 

/* 

event  counter 

i  nt 

j  . 

/» 

event  counter 

i  nt 

s  um. 

/• 

gaussian  only 

fl  =  fopen(*simulation  results  gauss"  ,  *w").  / •  open  files 

s ra ndom( g e t p i d { ) ) .  /•  initialize  random  number  genera'or 

.  f  (  f  1  —NULL  ) 

ex i t  f  )  . 

f  o  r  (  j  =0  ,  pc  n  t  •  j  ;  >0  0  j-’-'-J 

{  pqueuesize=0,  /•  initialize  all  statistical  keepers 

mqueuesize=0, 
rqueuesize=0, 
i  d  I  e  t  i  me  =  0  0  , 
rlsptime=  00; 
resptime=  0  0, 
rrsptime=  0  0; 
sysresptime=  0.0, 
f  o  r ( i=0, i < 1 6 ;  i ++) 

den  s 1 1  y [ i ] =0 ; 

k  t ime=8  0  16  0 , 


t  ime  I  i  s  t  ’  0  '  =  (floatllfl-Ol  r  a  ndom(  )  '  )  1  6  0 

f  o  r  (  i  =1  .  !<i  00  -EN'ENTS  .  i  -*-*• )  •  generate  random  events 

{  ,  •  0-15 

timel  ist  '  i  '  =  t  i  me  I  i  s  t  '  i  -  1  -*• 

( (float )gauss(  )  )  '16  0 


/•  preseat  t  ime 


r 1 s  pt ime=t ime I i s  t [ EVENTS  - 1  ]  ; 
k  t ime=0 . 0  ; 
r  r  s  pt ime=0  0 ; 
f  o  r ( i =0 ; i<£VENTS ; i  -*■+) 

{  event  t ime  =  gaus  s (  )  , 


/*  random  number  •/ 
/•  0  and  IS  •/ 


densityjeventtime] ++ ; 
event  t ime+=l : 

rtime  =  pent  J  j  J /16  0»{ float )  eventtime; 
rrspt ime+=rt ime ; 

9  t  atqueue  (  i  -t-1  ,  kt  ime  ,  r  t  ime  )  ;  /•  queuesise  •/ 


i f ( kt ime<t ime I i 9 t [ i ] )  /•  update  present  time  •  / 

{  i  d  I  e  t  i  me  +=t  i  me  I  i  s  t  [  i  J  -  k  t  i  me  ; 

k  t  i  me  =t  i  me  I  i  9 1  [  i  1  +r  t  i  me  ; 

} 

e  1 9  e 

kt ime+=rt ime , 


r e  9  pt i me+=k t ime - 1 ime I i 9 1 { i  ]  ; 


) 

fprintff  fl System  9 imn  I  at i onO ) , 

f p r i nt f ( f 1 , *  Samp  I e  get  size:  966  do , 

EVENTS)  ; 

fprintfffl, "Processing  time  of  level  2  (times  level  1  )  :966  2f0, 
pent [ j | ) ; 

fprmtf(fl," Average  processing  time  (level  1):  956  2f0, 

r  1  9  p  t  i  me  /  (  FEVENTS  -  1  0  )  )  , 

fprintff  fl Average  processing  time  (level  2)  958  2  10, 

r  r  s  p t  ime  / FEVENTS )  ; 

f p r i nt f ( f 1 , ’ Ave rage  sire  of  level-level  qnene.  956  2  f0, 

(float)  r queues i te /FEVENTS ) , 

f pr i n t f  (  f 1 ,  "Max imum  size  of  level-level  queue  956dO , 

mqueues are); 

f p r i nt f ( f 1 , "Ave rage  system  response  time  (times  tl)  966  2f0, 
1  0+(  rrspt ime/ FEVENTS ) / ( r 1 s pt ime /FEVENTS ) 

+( (  rqneues 1 xe/FEVENTS)» 

( rrspt ime/ FEVENTS )/( rlaptime/( FEVENTS -10)))), 

f pr 1 nt f ( f 1 , "Approx imat e  Percent  Idle  time  (level  2)  966  2f0, 

1 0 0  0 • 1 d I e t 1 me / 1 1  me  I  1st [ EVENTS | ) ; 


fprintff  fl."  Distribution  of  level  2  response  times  O), 
f  0  r  (  i=0,  i<8  ,  1  •*'+) 

fpri  nt  f  (  f  1  ,"fcl0d  -  956d  %d  -  956  d0, 

i+1 ,densi ty[ 1 | , 1 +0 , density) i+8] ) , 

fflushf  fl  )  , 

} 

)  t it  queue ( pos . prest ime . proct ime) 

1  n  t  pos, 

float  prestime,  proctime, 

(  1 nt  j  , 

pqueuesize=0, 

f  0  r  (  j  =  po  s  ,  (  (  j--  (  1  0  0  -EVENTS  )  )  AA’(  t  ime  I  1  st  j  ]<(prest  ime+proct  ime)  )  )  .  j  +-*■) 
pqueues 1 ze  — — . 

if(pqueuesize  :■ mq  u  e  u  e  9  1  z  e  ) 

mqueues 1 ze=pqueues 1 ze 

rqueues  1  ze--=  pqueues  lie. 

I 

!»«>’( ) 

{  register  int  i.j, 

1=0, 

f 0  r  (  1 =0 .  1 <20 .  1  — ) 

j  ~=  017  &  random(  )  , 
r  e  t  u  r  n  (  j  2  0  )  , 


Synchronous  System  Simulator  — 
Both  Levels  Uniform  Random  Variables 


#i acl  ad Kj  t d  i  o  h> 

#define  EVENTS  1000000 
#define  FEVENTS  1000000  0 

F 1 LE  • f 1 ;  /‘file  pointers  •/ 

i a  t  dens i ty 1  1 1 6  I  ;  /•  for  density  function  level  1  • 

int  dens i ty2  [  16  [  ,  /•  for  density  function  level  2  • 

float  sysresptime,  /•  system  response  time  •/ 

float  pent  []  =  {  1.00,  /•  ratio  of  mean  values  of  levels  •/ 

0 . 95 ,  0 .90  , 

0.85,  0.80, 

075,  0.70, 

0.65,  0.60, 

0,55,  0  SO  , 

-1  }  ; 

ma  i  n ( ) 


float 

1  1 1  i  me  ; 

/•  time  keeper  level  1 

•/ 

float 

1  2 1  i  me  ; 

j •  time  keeper  level  2 

•/ 

float 

1 1  t  i  me  , 

/•  time  keeper  level  1 

•/ 

float 

1 2 1  i  me  ; 

/•  time  keeper  level  2 

•/ 

i  n  t 

eventtimel. 

/•  event  duration  level 

1 

• 

i  nt 

eventtime2| 

/ •  event  duration  level 

2 

• 

i  nt 

i  ; 

/•  event  counter 

• 

i  nt 

J  ; 

j  *  pointer  to  statistica 

1  info 

• 

float 

xbar, 

/•  for  simplicity  --  avg 

resp  time 

of  •  / 

/•  level  1  • 


f 1 =  f open ( ’ db I bu f * , *  a”  )  ,  /•  open  files  •/ 

i  f  (  f  1  =NULL  ) 

exit(); 

I  1 1 ime=0  0  ; 

I  2 1 irae=0 . 0  , 
t 1 t i me=0  0 , 
t2  t ime=0  0 , 
sysresptime  -00, 


s  randomf  ge  t  p i d 

)); 

/•  in 

i  t  i  a 

ite  random  number  generator 

• 

f  o  r ( j  =0 . pent  j  j 

>0  ,  j  ++) 

(  1  1  t  i  me 

=  00, 

/• 

t  ime 

keeper 

level 

l 

• 

1  2  t  i  me 

=  00; 

/• 

t  ime 

keeper 

1  eve  1 

• 

t  1 1  i  me 

=  00; 

/• 

t  t  me 

keepe  r 

level 

1 

• 

t  2  t  ime 

=  00; 

/• 

t  ime 

keeper 

level 

<■> 

• 

sysresptime  =  00, 
f  o  r  (  i  =0  .  i<1  6  ,  i  ++) 

density  1  [ i | =de  ns i ty 2 [ i  ]=0. 
f  o  r  (  i=0,  i  <JJVENTS  i  -“-*■ ) 

{  eventtimel  =  0 1 7&rand om( ) .  /•  random  number 

eventtime2  =  0 1 7&random( ) .  •  random  number 

•  0  t  o  I  5 

densityljeventtimrlj-t-‘-;  •  density  fun 

density  2  eventti  me  •  density  fun 

llti me  =  (float)  eventtimel  '  16  0  , 

I  2  t i me  =  pent  ,  j  •  (float)  eventtime  2  16  0 

t  1 1  ime-|-=l  1  t  i  me  .  •  ivj  resp  tune  • 

t 2 t ime +=l 2 t ime .  •  avg  resp  time  • 

i f (  I  1 t ime<l 2t ime ) 

sysresptime  •*“=  12  time, 

else 

sysresptime  +=  lit ime. 

) 

xbar  a  1 1 1  ime  FE7VENTS  , 

fprmtf(fl.’0. 

fprintfjfl."  synchronous  svstemO). 

fprintf(fl  "?  amp  I e  set  s i i e 
EVENTS ) . 

fprintf(fl, "Processing  time  of  level  2  (times  level  l) 

PCBl’j,), 

fpriotff  fl Average  processing  time  (level  I) 


fp r i b t f ( f 1 , ’ Ave r»g e  processing  time  (level  3):  %6  2(0, 

1 2 1  i  me  /  FEVENTS  )  ; 

f p r i nt f ( f 1 , ’ Ave rig e  DB  system  response  time  (•  tl):  %6  2f0, 

(20*sysresptime/ FEVENTS )  /  x  b  »  r  ) ; 
f pr i nt f ( f 1 , * Ave r»ge  TB  system  response  time  (•  tl):  9o6  2f0, 

(J  0*  s  y  s  r  e  s  p  t  ime/ FEVETTTS ) /xbn  r )  ; 

fprintf( fl .*  Distribution  of  level  1  response  times:0), 
f  o  r ( i=0, i<8 , i ++) 

f  p  r  i  nt  f  ( f  1  ,  ’%l  Od  -  %6d  %6d  -  %6d0 

i  .densityl [ i ] , i+8,densityl [ i  +8  ] ) ; 
fprintf(fl,*  Distribution  of  level  2  response  times.O); 
f  o  r ( i =0 , i<8 ; i ++ 

fprintf  f  1  .  *961  0 . 2  f  -  %6d  %S  2t  -  <%6d0  , 
pcntlj  • ( Boat  )  i  , den s i t y2 [  i  j  , 
pent  |  j  •(float)(i+8),deos i  ty 2  ( i+8] ) ; 


Synchronous  System  Simulator  — 
Both  Levels  Gaussian  Random  Variables 


#i nc I ude<std i  o  h> 

#define  EVENTS  100000 
#d«fia«  FEVEJVTS  100000.0 
FILE  • f 1 i 
i at  deal i ty 1 j 16  j ; 
iat  denstty2:16|; 
float  sysresptime, 
float  pent  j]  =  {  100, 

0 . OS,  0  00 , 
0  8S ,  0  80, 
0 . 7S  ,  0.70, 
065,  0.60, 
0.55,  0.50, 
-i  }  ; 


/•  fil«  pointers 

j*  for  density  function  level  1 
/ •  for  density  function  level  2 
/•  system  response  time 
/•  ratio  of  mean  values  of  levels 


ma  i  n  (  ) 

{  float  llti me; 

float  I  2 1 i  me  , 
float  t 1 t i me , 
float  t2time; 
i n  t  e ven  t  t ime 1 ; 
int  eventtime 2, 

i  n  t  i  . 
i  nt  j  , 
float  xbar; 

i  n  t  k  . 


1 


/•  time  keeper  level 

/•  time  keeper  level  2 

/•  time  keeper  I  eve  I  1 

/ •  time  keeper  level  2 

/•  event  duration  level 
/»  event  duration  level 
/ •  event  counter 
/•  pointer  to  statistica 


info 


/•  for  simplicity  --  avg  resp  time  of  • 
/ •  level  1 

/ »  for  generation  of  a  gaussian  fun 
'•  using  the  central  limit  theorm 


f 1 =f open ( " db I bu f  gauss", "a");  /•  open  files 


if ( f 1  — NULL) 

ex i t (  )  , 


I  1 1 i me -0 . 0 ; 

I  2t ime=0 . 0 ; 
t 1 t ime=0 . 0 ; 

1 2 1 ime=0  0 , 
sysresptime  -  0  0; 


s  random( ge  t  p  i id ( ) ) , 
for(j=0pcntjj!>0.j-*- 
(  I  1 t i me  =  0  0 

I  2 1  i  me  =  0  0 
t 1 t i me  =  0  0 
t  2  t i me  =  0  0 
f  o  r  (  i=0,  id  6 
den  s i ty 1 
sysresptime  =  0  0 
f  o  r  (  i  =0  .  i  <EVET>n'S  ,  i 


initialize  random  number  generator 


/• 


t  i  me 
t  i  me 
t  i  me 
t  i  me 


keeper 

keeper 

keeper 

keeper 


level 

level 

level 

level 


dens 1 1  y2 ! i ’ =0 , 
*•) 


e  ve  n  1 1  i  me  I  =0  , 
f  o  r  (  k=0 , k<20 , k++) 

eventtimel  +=  0 1 7&r a ndom( ) ,  /•  random  number 

event  t ime2=0 , 
f  o  r(  k=0,  k<20  ,  k-^) 

eventtime*  +=  0  17S:random(  ) 


/=  20  ; 
/  =  2  0. 


event  t ime 1 
eventtime  2  ... 

densityl  ^eventtimel1 
density2  Jeventt ime2 


random  number 
0  t  o  1  S 
n  o rma I i z  a  t  ion 
n  o  rma I  i z  a  t  ion 
density  fun 
density  fun 


lltime  =  (float)  eventtimel  16  0  . 
I2time  =  peat [ j  J  ♦  (Boat)  eventtime 2 


16  0 


1 1 1  i  me  ■*=  I  1 1  i  me  , 
t  2  t  i  me  +=  I  2 1  i  me  , 


♦  avg  resp  t ime 

•  avg  resp  time 


i f ( 1 1 1 imecl 2 1 ime ) 

sysresptime  +=  I2time, 

else 


sysresptime  +=  lltime; 


xbar  =  1 1 1  ime /FEVENTS  ; 
fprintfffl.’O; 

fprintff  fl synchronous  systemO); 
f pr i at f ( f 1 , * Samp 1 e  set  site:  %6d0, 

EVENTS) ; 

f pr i at f ( f 1 , ’ Proce s s i ng  time  of  level  2  (times  level  l):%6.2f0 
pent  [  j  |  )  . 

f p r i nt f ( f 1 ,* Ave rage  processing  time  (level  1):  %$  2f0 

xbar )  ; 

fp r i nt f ( f 1 , ’Ave rage  processing  time  (level  2).  2f0 

t 2 t ime/ FEVENTS ) , 

f p r i n t f ( f 1 , ’ Ave rag e  DB  system  response  time  (•  tl)  %>  2 f 0 
(20‘sysresptime/  FEN'ENTS  ) / x  ba  r )  , 
fprintff fl,’ Average  TB  system  response  time  (•  tl)  2f0 

(3  0* sys res pt ime/ FEVENTS )/ xba r ) , 


fprintff  fl .*  Distribution  of  level  1  response  times  0), 
f  0  r  (  i  =0  ;  i<8  ;  I  +-*-) 

fprintfif  1  ,’°JlOd  -  ?o6d  %6d  -  ?56dO, 

i  , a e ns i t v  1  [  i  !  ,  i^. density!  (  i  +8  ]  ), 
fprintff  fl  Distribution  of  level  2  respoase  times  0), 

f  o  r ( i=0, i <8 ; i ++) 

fprintfffl  ,  ’ 9Sl 0 . 2  f  -  %6d  %8  2f  -  %6d0  , 

pent  [j  i*(fioat)i  ,density2[i|, 
pcnt]jj»(float)(i+8),den3ity2[i-t-8]), 


Results  of  Simulation 
Uniform  Distribution 


Asynchronous  Sy s t em  S imu I  a t i o n  --  Results 
Samp  I e  set  site 

Processing  time  of  level  2  (times  level  1 
Average  processing  time  (level  It 
Average  processing  time  (level  2) 

Average  site  of  level-level  queue 
Maximum  site  of  level-level  queue: 

Average  system  response  time  (times  til 
Approximate  Percent  Idle  time  (level  2) 


Di  s  t  r i bu  t i on  of 

level  2  response  times 

1  - 

12656 

9  - 

12426 

2  - 

12499 

10  - 

12245 

3  - 

12356 

1  1  - 

1244  6 

4  - 

12700 

12  - 

1  2438 

S  - 

1252  3 

13  - 

12664 

6  - 

1 2  S 1  3 

1  4  - 

12548 

7  - 

12596 

15  - 

12557 

8  - 

1  2  4  4  8 

16  - 

1  2385 

Asynch  ro  nou  s 

System 

S i mu  1  a  t ion 

-  -  Results 

Samp  I e  set  site 

Processing  time  of  level  2  (times  level  1 
Average  processing  time  (level  ll: 

Average  processing  time  (level  2) 

Average  site  of  level-level  queue 
Maximum  site  of  level-level  queue 
Average  system  response  time  (times  til: 
Approximate  Percent  Idle  time  (level  2) 
Distribution  of  level  2  response  times 

1  -  12396  9  -  12541 

2  -  1247S  10  -  12281 

3  -  12S71  11  -  12637 


(  u  n  i  f  o  rm ) 
200000 
1  00 
0  53 
0  S3 
62  81 
177 
64  62 
0  30 


(uniform) 

200000 
0  95 
0  S3 
0  SI 
6  01 
3  9 
7  69 
4  58 


Asynchronous  System  Simulation  --  Results  (uniform) 


Simp  1  e 

9  et 

9  i  i  e 

200000 

Proc  e  s  s 

mg 

time  of  level 

2  (times  1 e  v  e 

D 

0 

80 

Average 

processing  time 

(level  1) 

0 

53 

Average 

processing  time 

(level  2) 

0 

48 

Average 

9iie  of  level-level  queue 

o 

87 

Max  tmum 

s  tie  of  level-level  queue 

24 

Average 

system  response 

time  ( t  i  me  9  1 1 

4 

58 

Approximate 

Percent  Idle 

time  (level  2 

8 

57 

Distribution  of  level 

2  response  times 

1 

-  12628 

8  -  12660 

2 

-  12766 

10  -  12376 

3 

12360 

11  -  12681 

4 

12286 

12  -  12562 

S 

-  12460 

13  -  12502 

6 

-  12438 

14  -  12383 

7 

-  12380 

16  -  12606 

8 

-  12666 

16  -  12357 

Asynchronous  System  Simulation  --  Reso 

t  s 

un  i 

'  o  rm) 

Samp  1  e 

set 

9  i  i  e 

200000 

Proc  es  s 

1  Dg 

time  of  level 

2  (times  1 e  v  e 

1) 

0 

85 

Average 

processing  time 

(level  1): 

0 

S3 

Average 

processing  time 

(level  2) : 

0 

45 

Ave  rage 

s  ne  of  level-levef  queue: 

1 

76 

Max imum 

site  of  level-level  queue 

30 

Average 

system  response 

time  (times  t 1 

3 

34 

Approx imate 

Percent  idle 

time  (level  2 

1  4 

88 

Distribution  of  level  2  response  times 


1  - 

1  2623 

8  - 

1  2648 

2  . 

1  2348 

10  - 

12420 

3  • 

1254  4 

1  1  - 

12562 

4  - 

12540 

12  • 

12613 

5  - 

12274 

13  - 

12665 

6  - 

12527 

14  - 

1  2443 

7  • 

12507 

15  - 

12378 

8  • 

12686 

16  - 

12208 

Asynchronous  System  Simulation  --  Results  (uniform) 


Sample  set  site:  200000 
Processing  time  of  level  2  (times  level  1)  0  80 
Average  processing  time  (level  11.  0  S3 
Average  processing  time  (level  2)  0  43 
Average  sise  of  level-level  queue  1  22 
Maximum  sire  of  level-level  queue  13 
Average  system  response  time  (times  til:  2  T8 
Approximate  Percent  Idle  time  (level  2)  18  63 


Distribution  of  level  2  response  times 


1  - 

12590 

9  - 

12536 

2  - 

12320 

10  - 

1  254  8 

3  - 

12349 

1  1  - 

12517 

4  - 

1251  4 

12  - 

12639 

S  - 

12452 

13  - 

12622 

6  - 

12427 

1  4  - 

12323 

7  - 

12540 

15  - 

12636 

8  - 

1237  1 

16  - 

12616 

Asynchronous 

System 

Simulation 

-  -  Res 

Sample  set  site  200000 
Processing  time  of  level  2  (times  level  1)  0  7S 
Average  processing  time  (level  11  0  S3 
Average  processing  time  (level  2)  0  40 
Average  sise  of  level-level  queue:  0  88 
Maximum  sise  of  level-level  queue:  12 
Average  system  response  time  (times  tl)  2  41 
Approximate  Percent  Idle  time  (level  2)  24  66 


Distribution  of  level  2  response  times 
1  -  12579  9  -  12619 


Asynchronous  System  Simulation  --  Results  (uniform) 


Simp  I e  set  s i >  e 

Processing  time  of  level  2  (times  level  1) 
Avenge  processing  time  (level  1) 

Avenge  processing  time  (level  2) 

Avenge  site  of  level-level  queue 
Mix imum  siie  of  level-level  queue 
Avenge  system  response  time  (times  t  1 ) 
Approximate  Percent  Idle  time  (level  2) 


Distribution  of  level  2  response  times: 


1  - 

1226  5 

9  - 

12373 

'■>  _ 

12506 

10  - 

12328 

3  - 

1  2435 

1  1  - 

12543 

4  - 

12437 

12  - 

12528 

5  - 

12560 

13  - 

12696 

6  - 

12626 

1  4  - 

12677 

7  - 

1  2504 

1  5  - 

12512 

8  - 

12468 

16  - 

12542 

Asynchronous  System  Simulition  --  Results 
Simp  I e  set  s i se : 

Processing  time  of  level  2  (times  level  1) 
Avenge  processing  time  (level  1): 

Avenge  processing  time  (level  2; 

Avenge  site  of  level-level  queue: 

Maximum  sue  of  level-level  queue 


Avenge  system  response 

t  i  me 

(times  t 1 ) 

Approx imit  e 

Percent  Idle 

t  ime 

(level  2) 

Di s  t  r i but 

ion  of  level 

2  response  t imes 

1  - 

12339 

9 

-  12635 

o  . 

12409 

10 

-  12318 

3  - 

1  2  4  9  3 

1  1 

•  12399 

4  - 

12467 

12 

•  12665 

5  - 

1  2456 

1  3 

-  12631 

6  - 

12596 

1  4 

-  12433 

7  - 

12619 

1  5 

-  12743 

8  - 

1  234  7 

16 

-  12450 

200000 
0  70 
0  53 
0  37 
0  67 
10 
2  17 
29  63 


(uniform) 

200000 
0  65 
0  53 
0  35 
0  51 
9 

1  98 
34  64 


Asynchronous  System  Simulation  --  Results  (uniform) 


Simp  I e  set  sise 
Proces  sing  t ime 


s  f  level  2  (times 


Avenge  processing  time  (level  1): 
Avenge  processing  time  (level  2) 
Avenge  sise  of  level-level  queue 
Miximum  site  of  level-level  queue; 
Avenge  system  response  time  (times  til 
Approx i mite  Percent  Idle  time  (level  2) 


200000 
1):  0  60 
0  S3 
0  32 
0  38 


Distribution  of  level  2  response  times 


1  - 

12506 

9  - 

12  3  9  1 

2  - 

12429 

10  - 

12380 

3  - 

1  2453 

1  1  - 

12482 

4  - 

12642 

12  - 

12359 

5  - 

1  2535 

13  - 

1  2655 

6  - 

12729 

14  - 

1  2  4  7  6 

7  - 

12487 

15  - 

1  247  3 

8  - 

12504 

16  - 

1  2499 

Asynchronous  System  Simulation  --  Results  (uniform) 


Simple  set  sise  200000 
Processing  time  of  level  2  (times  level  1)  0  55 
Avenge  processing  time  (level  1)  0  S3 
Avenge  processing  time  (level  2)  0  29 
Avenge  site  of  level-level  queue  0  28 
Maximum  sise  of  level-level  queue  7 
Avenge  system  response  time  (times  til  1  71 
Approximate  Percent  Idle  time  (level  2)  44  67 


Distribution  of  level  2  response  times 

1  -  12512  9  -  12606 

2  -  12568  10  -  12391 

3  -  12451  11  -  12582 

4  -  12568  12  -  12374 

5  -  12453  13  •  12295 

6  -  12348  1 4  -  12304 


Asynchronous  System  Simulation  --  Results  (uniform) 


Simple  set  site  200000 
Processing  time  of  level  2  (times  level  1):  050 
Avenge  processing  time  (level  1)  0.53 
Avenge  processing  time  (level  2)  0  27 
Avenge  site  of  level-level  queue  0  22 
Miximumsiie  of  level-level  queue  8 
Avenge  system  response  time  (times  1 1  )  .  161 
Approximate  Percent  idle  time  (level  2)  48  80 


Distribution  of  level  2  response  times: 


1  - 

12477 

9  -  12426 

<■>  _ 

1  2700 

10  -  12431 

3  - 

1  2353 

11  -  12400 

4  - 

12307 

12  -  12585 

5  - 

12569 

13  -  12624 

6  - 

12480 

14  -  12546 

jTTOiwim.TtT.’Twmw.T.mN.'ww  ^  v  v  vi 
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Results  of  Synchronous  Simulation 
Uniform  Distribution  (both  levels) 


Synchronous  System  -•  Simulation  Results 
Sample  set  site:  1000000 

Processing  time  of  level  2  (times  level  1)  1  00 

Average  processing  time  {  level  lj:  047 

Average  processing  time  (level  2):  0  47 

Average  DB  system  response  time  (•  til:  2.71 

Average  TB  system  response  time  (•  tlj:  4  06 

Distribution  of  level  1  response  times: 


0 

6261  1 

8 

- 

6243  1 

1 

61913 

9 

62431 

2 

63096 

10 

- 

62631 

3 

62507 

1  1 

- 

62470 

4 

62055 

12 

- 

62663 

5 

62443 

13 

- 

62532 

6 

6234  8 

14 

- 

62681 

7 

62763 

15 

- 

62425 

tribal 

on  of  level 

2  response 

times 

0.00 

62649 

8 . 00 

- 

62404 

1  00 

6239  1 

9  00 

- 

62432 

2 .00 

62940 

10  00 

- 

62296 

3  00 

62249 

1100 

- 

6241  8 

4  00 

62664 

12  00 

- 

6276  1 

5.00 

62227 

13  00 

- 

62460 

6  00 

62898 

14  00 

- 

6245  1 

7  00 

62640 

15.00 

- 

62120 

Synchronous  System  --  Simulation  Results 


Samp  1  e 

set 

site: 

1000000 

Proces  s 

ing 

time  of  level 

O 

(times  1 

evel  1) 

0 

95 

Average 

processing  time 

1  eve  1  1 

: 

0 

47 

Average 

processing  time 

level  2 

■ 

0 

45 

Average 

DB 

system  response 

t  i  me 

• 

H) 

o 

64 

Ave  rage 

TB 

system  res  poo  s  e 

t  ime 

• 

1 1  ) 

3 

96 

Distribution  of  level 

1 

res  pons  e 

times 

0 

-  62729 

8 

62399 

1 

61969 

9 

62502 

•"> 

•  62659 

10 

6  2  4  6  0 

3 

-  62611 

1  1 

6247  1 

4 

-  62624 

12 

62204 

5 

-  62544 

1  3 

6  23  2  3 

6 

-  62803 

1  4 

62397 

7 

•  62507 

1  5 

62798 

Distribution  of  level 

O 

response 

times 

0 

00 

-  62583 

7  60 

- 

629  4  3 

0 

95 

-  62580 

8  55 

- 

627  1  5 

1 

90 

-  62783 

9  SO 

- 

6  2  5  7  3 

■"> 

85 

-  62311 

10  45 

- 

6247  4 

3 

80 

-  62338 

II  40 

- 

6252  1 

4 

75 

-  62128 

12  3  5 

- 

6  2  3  0  9 

5 

70 

-  62499 

13  3  0 

- 

6  2  3  3  9 

6 

65 

-  62205 

14  2  5 

- 

6  2  6  9  9 

Synchronous  System  -•  Simulation  Results 


Sample  set  site:  1000000 

Processing  time  of  level  2  (times  level  1):  0  80 

Avenge  processing  time  f  level  1):  0.4? 

Avenge  processing  time  (level  2):  0.37 

Avenge  DB  system  response  time  (•  tl):  2  46 

Avenge  TB  system  response  time  (*  tl):  3  69 


Distribution  of  level 

1  response 

times 

0  - 

62454 

8  - 

62272 

1  - 

62655 

9  - 

62635 

2  - 

62566 

10  - 

62962 

3  - 

62384 

1 1  - 

62366 

4  - 

6262  1 

12  - 

62582 

5  - 

62335 

13  - 

62383 

6  - 

62344 

1  4  - 

62251 

7  - 

62599 

15  - 

62591 

Di s  t  r i but 

on  of  1 evel 

2  response 

times 

0  00  - 

62720 

6  40  - 

62582 

0.80- 

62034 

7  20  - 

62221 

1  60  - 

62451 

8.00- 

62485 

2.40  - 

62340 

8  80  - 

62636 

3  20  - 

62784 

9  60  - 

62649 

4  00  - 

62567 

10  40  - 

62375 

4  .80  • 

6293  1 

11  20  - 

6223  1 

5  60  - 

62688 

12  00  - 

62306 

Synchronous  System  --  Simulation  Results 


Sample  set  sise  1000000 

Processing  time  of  level  2  (times  level  l)  0  73 
Average  processing  time  (level  11  0  47 

Avenge  processing  time  (level  2)  0  36 

Average  DB  system  response  time  (•  tl)  2  42 

Average  TB  system  response  time  (•  tl)  3  63 


Distribution  of  level 

1  response 

times 

0 

6233  1 

8  - 

62582 

1 

62942 

9  - 

62269 

2 

62886 

10  - 

62372 

3 

62392 

1  1  - 

62994 

4 

62057 

12  - 

62223 

5 

624  1  9 

13  - 

62382 

6 

62594 

14  - 

6247  1 

7 

62848 

15  - 

62238 

Di s  t  r  i  bu  t 

ion  of  level 

2  response 

times 

0  00 

- 

62128 

6  00  - 

62473 

0  75 

- 

62267 

6  75  - 

62158 

1  SO 

- 

62201 

7  50  - 

63099 

2  25 

- 

62393 

8.25  - 

62702 

3  00 

- 

62795 

9  00  - 

62648 

3.75 

- 

61868 

9.75  - 

6247  4 

4  50 

- 

62857 

10  50  - 

62456 

5  25 

- 

62884 

I  I  25  - 

62597 

•  *  <  '  •  *  Si  *  »’ 

I  ■  »  •  .  *  -  I 


Synchronous  System  --  Simulation  Results 


Sample  set 

9  i  s  e  : 

1000000 

Proces  s i ng 

time  of  level 

2  (times  1 

eve  1  1) 

0  70 

Average  processing  time 

level  1 

0  47 

Average  processing  time 

(level  2 

): 

0  33 

Average  DB 

system  response  time 

• 

tl) 

2  37 

Average  TB 

system  response  time 

• 

‘1) 

3  55 

Distribution  of  level 

1  response 

times 

0 

-  62515 

8 

- 

62226 

1 

-  62257 

9 

- 

62664 

2 

-  62571 

10 

- 

62478 

3 

-  62362 

1 1 

- 

62714 

4 

-  61982 

12 

- 

62357 

S 

-  62765 

13 

- 

62692 

6 

-  62725 

1  4 

- 

62722 

7 

•  62889 

15 

* 

62081 

Distribution  of  level 

2  response 

times; 

0  00 

-  62576 

5  60 

- 

6244  1 

0  70 

-  62634 

6  30 

- 

62721 

t  40 

-  62677 

7  00 

- 

62600 

2  10 

-  62308 

7  .  70 

- 

62026 

2  80 

62822 

8  40 

- 

62664 

3  .50 

-  62174 

9.10 

- 

62390 

4  .20 

-  62251 

9  80 

. 

62684 

4  90 

-  62605 

10  so 

- 

62427 

Synchronous  System  • 

-  Simulation  Results 

Sample  set 

9  i  i  e 

1000000 

Process i ng 

time  of  level 

2  (times  level  1) 

0  65 

Average  processing  time 

level  1 

: 

0  47 

Average  processing  time 

level  2 

: 

0  31 

Average  DB 

system  response  time 

• 

til 

2  32 

Average  TB 

system  response  time 

• 

tl) 

3  48 

Distribution  of  level 

1  response 

times 

0 

-  62564 

8 

- 

62357 

1 

-  62201 

9 

- 

62561 

2 

-  62354 

10 

- 

62669 

3 

-  62774 

1  I 

. 

62677 

4 

•  61901 

12 

- 

62567 

5 

-  62713 

13 

- 

6274  4 

S 

-  62601 

1  4 

- 

62409 

7 

•  62490 

15 

’ 

62418 

Distribution  of  level 

2  response 

times 

0  00 

•  62496 

5  20 

62277 

0  65 

-  62778 

5  85 

62153 

1  30 

-  62732 

6  50 

62602 

1  95 

-  62464 

7  IS 

62435 

2  60 

-  62456 

7  80 

6261  8 

3  25 

-  62163 

8  45 

62660 

3  90 

-  62800 

9  10 

62509 

4  55 

-  62666 

9  75 

- 

62191 

Synchronous  System  --  Simulation  Results 


Sample  set  site  1000000 

Processing  time  of  level  2  (times  level  1)  0  60 

Average  processing  time  (level  1)  0  47 

Average  processing  time  (level  2)  0  28 

Average  DB  system  response  time  (•  tl)  2  27 

Average  TB  system  response  time  (•  tl)  3  41 


Di s  t  r i bu  t 

ion  of  level 

response 

times 

0  - 

63046 

8  - 

62723 

1  - 

62426 

9  - 

6231  2 

2  - 

62174 

10  - 

62531 

3  - 

6247  1 

1  1  - 

62439 

4  - 

62266 

12  - 

624  1  0 

5  - 

62683 

13  - 

6285  1 

6  - 

62791 

14  - 

62163 

7  - 

62684 

15  - 

62230 

Di s  t  r i bu  t 

ion  of  level 

2  response 

times 

0  00  - 

62734 

4  80  - 

62556 

0.60  • 

62192 

5  40  - 

62259 

120- 

62138 

6  00  - 

62521 

1  80  - 

62455 

6  60  - 

62235 

2.40  - 

6251  5 

7  20  - 

63334 

3.00  ■ 

62996 

7  80  - 

62668 

3.60  • 

62689 

8  40  - 

61948 

4.20- 

62349 

9  00  - 

624  1  1 

Synchronous  System  --  Simulation  Results 


Sample  set  site:  1000000 

Processing  time  of  level  2  (times  level  1)  0  55 

Average  processing  time  (level  II  0  47 

Average  processing  time  (level  21:  0  26 

Average  DB  system  response  time  (•  til:  2  24 

Average  TB  system  response  time  (•  tl):  3  35 


Distribution  of  level  1  response  times: 


0  - 
1  - 
2  - 

62725 

62429 

63000 

8  • 
9  - 
10  - 

62429 

62470 

62348 

3  - 

62635 

11  - 

62509 

4  - 

62747 

12  - 

626  1  7 

5  - 

62252 

13  - 

62215 

6  - 

62702 

14  - 

62425 

7  - 

6241  4 

15  - 

62083 

stributioo  of  leve 

2  response  times 

0  00  • 

62570 

4 

40  - 

62951 

0  55  - 

62149 

4 

95  - 

62277 

1  to  - 

62603 

5 

SO  - 

62386 

1.65- 

6221  7 

6 

05  - 

6254  1 

2  20  - 

62276 

6 

60  - 

6231  2 

2.75  - 

62464 

7 

15  - 

62526 

3.30- 

62866 

7 

70  - 

63072 

3  85  - 

62456 

8 

25  - 

62334 

Synchronous  System  -•  Simulation  Results 


Sample  set  sise  1000000 

Processing  time  of  level  2  (times  level  l)  0  SO 
Average  processing  lime  (level  1)  0  <7 

Average  processing  time  (level  2)  0  25 

Average  DB  system  response  time  (•  til:  2  21 

Average  TB  system  response  time  (•  tl)  3  32 


Di s  t  r i bn  t 

ion  of  level 

1  response 

times 

0  - 

62562 

8  - 

62326 

1  - 

62529 

9  - 

6201  8 

2  - 

62673 

10  - 

62100 

3  - 

62358 

1  1  - 

62485 

4  - 

63036 

12  - 

62492 

S  - 

62981 

13  - 

62256 

6  - 

6261  3 

14  - 

62439 

7  - 

62465 

15  - 

62667 

Distribution  of  level  2  response  times 


0 

00  - 

62453 

4 

00  - 

62375 

0 

50  - 

62586 

4 

50  - 

62258 

1 

00  - 

62399 

5 

00  - 

62501 

1 

so  - 

6301  3 

5 

50  - 

62157 

o 

00  - 

62285 

6 

00  - 

62615 

o 

50  - 

62390 

6 

50  - 

62515 

3 

00  - 

624  18 

7 

00  - 

62643 

3 

SO  - 

62794 

7 

50  - 

62598 

Asynchronous  System  Simulation  Results  (Gaussian) 


Sample  set  sise:  100000 

Processing  time  of  level  2  (times  level  1):  0  90 

Average  processing  time  (level  1):  0.44 

Average  processing  time  (level  2):  0.45 

Average  sise  of  level-level  queue:  1432.57 

Maximum  site  of  level-level  queue:  2801 

Average  system  response  time  (times  1 1  >  :  1475.48 

Approximate  Percent  Idle  time  (level  2)  0.00 


Distribution  of  level  2  response  times: 

1  -  0  9  -  24510 

2  -  0  10  -  7019 

3  -  0  11-772 

4  -  28  12  -  27 

5  •  630  13  -  1 

6  -  6371  14  -  0 

7  -  23830  IS  -  0 

8  -  36812  16  -  0 


Asynchronous  System  Simulation  Results  (Gaussian) 


Sample  set  site  100000 
Processing  time  of  level  2  (times  level  1):  0.85 
Average  processing  time  (level  1):  0.44 
Average  processing  time  (level  2):  0.43 
Average  site  of  level-level  queue:  1.00 
Maximum  site  of  level-level  queue:  7 
Average  system  response  time  (times  tl):  2.95 
Approximate  Percent  Idle  time  (level  2)  2.92 


Distribution  of  level  2  response  times: 


1  - 

0 

9  - 

24832 

2  - 

0 

10  - 

6991 

3  - 

0 

11  - 

732 

4  - 

29 

12  • 

29 

5  - 

652 

13  - 

1 

6  - 

6272 

14  - 

0 

7  - 

23746 

15  - 

0 

Asynchronous  System  Simulation  Resalts  (Gaussian) 

Sample  set  siie;  100000 
Processing  time  of  level  2  (times  level  1)  080 
Average  processing  time  (level  1 1  O  H 
Average  processing  time  (level  2)  0(0 
Average  sise  of  level-level  queue :  0.41 
Maximum  sise  of  level-level  queue:  3 
Average  system  response  time  (times  til:  2  28 
Approximate  Percent  Idle  time  (level  2)  8  69 


Distribntion  of  level  2  response  times 


1  - 

0 

9  - 

24700 

2  - 

0 

10  - 

6847 

3  - 

0 

11  - 

792 

4  - 

19 

12  - 

24 

5  - 

650 

13  - 

0 

6  - 

6257 

14  - 

0 

7  - 

23787 

IS  - 

0 

8  • 

36924 

16  - 

0 

Asynchronous  System  Simulation  Results  (Gaussian) 


Sampl e  set  site: 

Processing  time  of  level  2  (times  level  1 
Average  processing  time  (level  1): 

Average  processing  time  (level  2): 

Average  sise  of  level-level  queue: 
Maximum  sise  of  level-level  queue: 
Average  system  response  time  (times  til: 
Approximate  Percent  Idle  time  (level  2; 


100000 
0  75 


Distribution  of  level  2  response  times 


24621 

7038 

712 

21 

0 

0 

0 

n 


Asynchronous  System  Simulation  Results  (Gaussian) 


Sample  set  site:  100000 
Processing  time  of  level  2  (times  level  1):  0.70 
Average  processing  time  (level  1}:  0  44 
Average  processing  time  (level  2)  0  35 
Average  site  of  level-level  queue:  0  07 
Maximum  site  of  level-level  queue  1 
Average  system  response  time  (times  tl):  1  86 
Approximate  Percent  Idle  time  (level  2)  20  00 


Distribntion  of  level  2  response  times: 

1  -  0  9  -  24799 

2  -  0  10  -  6774 

3  -  0  11-750 

4  -  24  12  -  34 

5  •  640  13  -  1 

6  •  6340  14  -  0 

7  -  23759  15  -  0 

8  •  36870  16  -  0 


Asynchronous  System  Simulation  Results  (Gaussian) 


Sample  set  site:  100000 
Processing  time  of  level  2  (times  level  1):  0  65 
Average  processing  time  (level  1):  0  44 
Average  processing  time  (level  2):  0  33 
Average  site  of  level-level  queue:  0  03 
Maximum  site  of  level-level  queue  1 
Average  system  response  time  (times  til:  1.76 
Approximate  Percent  Idle  time  (level  2)  25.75 


Distribution  of  level  2  response  times. 
1  -  0  9  -  24622 

o  -  0  10  -  6779 

0  11  -  775 


3 


Asynchronous  Sy 9 t  cm  S imu I  at i on  Reau 1 1 s  (Gaussian) 


Sample  set  aiie:  100000 
Processing  time  of  level  2  (times  level  1):  0.60 
Average  proceaaiag  time  (level  lj:  0  44 
Average  proceaaiag  time  (level  2):  0  30 
Average  aiie  of  level-level  queue:  0.01 
Maximum  aiie  of  level-level  queue:  1 
Average  ayatem  response  time  (times  tl):  169 
Approximate  Percent  Idle  time  (level  2)  31  47 


Distribution  of  level  2  response  times: 


1  - 

0 

9  -  24S73 

2  - 

0 

10  - 

6950 

3  - 

0 

1  1  - 

7S6 

4  • 

17 

12  - 

2  5 

5  - 

632 

13  - 

0 

6  - 

6423 

14  - 

0 

7  - 

23585 

IS  - 

0 

g  - 

37039 

16  - 

0 

Asynchronous 

System  Simulation  Results 

(Gau  s sian) 

Sample  set  s 

i  s  e 

100000 

Processing  time  of  level 

2  (times 

1  eve 

1  )  0 

55 

Average  processing  time 

(level  1) 

0 

44 

Average  processing  time 

(level  2) 

0 

28 

Average  site 

of  level-level  queue 

0 

00 

Max imum  site 

of  level-level  queue 

1 

Average  system  response 

time  (times  t 1 

1 

6  3 

Approx imat  e 

Percent  Idle 

time  (level  2 

37 

1  6 

Distribution  of  level  2  response  times 


1  - 

0 

9  - 

24798 

2  - 

0 

10  - 

6967 

3  - 

0 

1  1  - 

777 

4  - 

23 

12  - 

3  3 

5  - 

631 

13  - 

0 

6  - 

6226 

1  4  - 

0 

7  - 

23377 

15  - 

0 

8  • 

37168 

16  - 

0 

296 


Asynchronous  System  Simulation  Results  (Gaussian) 


Sample  set  site:  100000 
Processing  time  of  level  2  (times  level  1):  0.50 
Average  processing  time  (level  l)  0  44 
Average  processing  time  (leve  I  2):  0.25 
Average  siie  of  level-level  queue:  0.00 
Maximum  sire  of  level-level  queue  1 
Average  system  response  time  (times  tl):  1  57 
Approximate  Percent  Idle  time  (level  2)  42.88 


Distribution  of  level  2  response  times: 


1  - 

0 

9  - 

24834 

2  - 

0 

10  - 

6753 

3  - 

0 

1  1  - 

726 

4  - 

23 

12  - 

29 

5  - 

634 

13  - 

0 

6  - 

6126 

1  4  - 

0 

7  - 

23789 

15  - 

0 

8  - 

37086 

16  - 

0 

Results  of  Synchronous  System  Simulation 

Gaussian  Distribution  (both  levels) 

Synchronous 

Giussim  System  --  Simu 

1  it  i  on 

Results 

Simple  set 

site 

100000 

Proces  s i ng 

time  of  level 

2 

(times  level  I) 

1 

00 

Avenge  processing  time 

level  1 

0 

44 

Avenge  processing  time 

level  2' 

0 

44 

Avenge  DB 

system  response 

time  (  • 

tn 

o 

17 

Avenge  TB 

system  response 

time  (  • 

ti) 

3 

25 

Distribution  of  level 

1 

response 

times 

0 

0 

8  - 

2  4  7  4  9 

1 

0 

9  - 

6850 

2 

0 

10  - 

786 

3 

20 

1  1  - 

32 

673 

12  - 

0 

5 

6335 

13  - 

0 

6 

-  23588 

1  4  - 

0 

7 

-  36967 

15  - 

0 

Distribution  of  1  eve  1 

O 

response 

times 

0  00 

0 

8  00  - 

2453  1 

t  .00 

0 

9  00  - 

694  3 

2  00 

0 

10  00  - 

695 

3  00 

21 

1100  - 

27 

4  00 

651 

12  00  - 

0 

5  00 

625  1 

13  0  0  - 

0 

6  00 

•  2364! 

1  4  00  - 

0 

7  00 

•  37240 

15  00  - 

0 

Synchro  non  s 

Gaussian  System  --  Simulation 

Re  s  u 1 t  s 

Simple  set 

site 

100000 

Proces  sing 

time  of  level 

(times  1 

evel  1) 

0 

95 

Avenge  processing  time 

i 

eve  1  1): 

0 

44 

Avenge  processing  time 

level  2): 

0 

4  2 

Avenge  DB 

system  response 

time  1  • 

1 1  ) 

2 

13 

Avenge  TB 

system  response 

time  ( • 

tl) 

3 

19 

Distribution  of  1  eve  1 

l 

response 

times 

0 

0 

8  - 

24966 

1 

0 

9  - 

6866 

o 

* 

0 

10  - 

682 

3 

20 

1  1  - 

20 

4 

635 

12  • 

0 

5 

6442 

1  3  - 

0 

6 

-  23574 

1 4  - 

0 

7 

•  36795 

15  - 

0 

Distribution  of  level 

O 

response 

times 

0  00 

0 

7  60  - 

24461 

0  93 

0 

8  55  - 

7009 

1  90 

0 

9  50  • 

774 

2  85 

28 

1  0  4  5  - 

24 

3  80 

621 

1140  - 

0 

4.75 

624  6 

12  35  - 

0 

5.70 

-  23636 

1  3  30  - 

0 

6  65 

-  37201 

1  4  25  - 

0 

Synchronous  Gaussian  System  --  Simulation  Results 


Samp  1 e  set 

sis  e 

100000 

Proc  es s i ng 

time  of  level 

2 

(times  1 

eve  1  1) 

0  90 

Average  processing  time 

(level  1  ) 

0  44 

Average  processing  time 

level  2) 

0  40 

Average  DB 

system  response 

time  1  • 

t 1 ) 

2  08 

Average  TB 

system  res  pon  s  e 

time  (  • 

t  1  ) 

3  13 

Distribution  of  1  eve  1 

, 

response 

times: 

0 

0 

8  - 

24470 

1 

0 

0  - 

6030 

2 

0 

10  - 

761 

3 

18 

1  1  - 

37 

4 

60S 

12  - 

0 

5 

6283 

13  - 

0 

6 

-  23527 

14  - 

0 

7 

-  37270 

15  - 

0 

Distribution  of  level 

2 

r  e  s  pon  s  e 

times: 

0  00 

0 

7.20  - 

24830 

0  00 

0 

8  10- 

6800 

1  80 

0 

0  00  - 

754 

2.70 

28 

0  00  - 

31 

3  60 

667 

10  80  - 

1 

4  50 

6305 

11  70  - 

0 

5  40 

-  23770 

12.60  - 

0 

6  30 

-  36715 

1  3  50  - 

0 

Synchronous 

Gaussian  System  --  Simulation 

Results 

Samp  1 e  set 

site 

100000 

Processing 

time  of  level 

2 

(times  level  1) 

0  85 

Average  processing  time 

level  1): 

0  44 

Average  processing  time 

level  2  : 

0  37 

Average  DB 

system  response 

time  (  • 

tl). 

2 . 05 

Average  TB 

system  response 

time  ( • 

t  1  )  : 

3  08 

Distribution  of  level 

1 

response 

times: 

0 

0 

8  - 

24784 

1 

0 

0  - 

6885 

2 

I 

10  - 

713 

3 

21 

1  1  - 

32 

4 

686 

12  - 

0 

5 

6205 

13  - 

0 

6 

-  23724 

I  4  - 

0 

7 

-  36850 

15  - 

0 

Distribution  of  level 

O 

response 

times: 

0  00 

0 

6  80  - 

24793 

0  85 

0 

7  65  - 

6972 

170 

0 

8  50  - 

730 

2  55 

20 

9  35  - 

26 

3  40 

660 

10  20  - 

0 

4  25 

6233 

1105- 

0 

5  10 

-  23732 

1!  90  - 

0 

5.05 

-  36834 

12  75  - 

0 
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